Talbi 2021

Automated Design of Deep Neural Networks: A Survey and
Unified Taxonomy
EL-GHAZALI TALBI, University of Lille, CNRS, INRIA 34
In recent years, research in applying optimization approaches in the automatic design of deep neural net-
works has become increasingly popular. Although various approaches have been proposed, there is a lack of
a comprehensive survey and taxonomy on this hot research topic. In this article, we propose a unified way
to describe the various optimization algorithms that focus on common and important search components
of optimization algorithms: representation, objective function, constraints, initial solution(s), and variation
operators. In addition to large-scale search space, the problem is characterized by its variable mixed design
space, it is very expensive, and it has multiple blackbox objective functions. Hence, this unified methodol-
ogy has been extended to advanced optimization approaches, such as surrogate-based, multi-objective, and
parallel optimization.
CCS Concepts: • Computing methodologies → Search methodologies;
Additional Key Words and Phrases: Metaheuristics, machine learning, optimization, deep neural networks,
hyperparameter optimization, network architecture search
ACM Reference format:
El-Ghazali Talbi. 2021. Automated Design of Deep Neural Networks: A Survey and Unified Taxonomy. ACM
Comput. Surv. 54, 2, Article 34 (March 2021), 37 pages.
https://doi.org/10.1145/3439730
1 INTRODUCTION
In recent years, deep neural networks (DNNs) have enabled significant progress in many applica-
tion domains, including computer vision and natural language processing (NLP) [67]. The design
of DNNs has proven to be critical. Currently employed DNN architectures have mostly been devel-
oped manually by human experts, which is a time-consuming, error-prone process, and it prevents
finding new architectures that go beyond the human domain knowledge. Consequently, there is a
growing interest in automated neural architecture search and hyperparameters optimization (Au-
toDNN) [87]. It allows the design of more efficient and effective DNNs and more accessibility to
non expert for solving diverse learning tasks. AutoDNN approaches outperformed handcrafted ar-
chitectures for some learning tasks, such as image classification [150], object detection [219], and
semantic segmentation [32].
This work has been supported by PGMO project and University of Luxembourg.
Author’s address: El-Ghazali Talbi, University of Lille, CNRS, INRIA, Polytech’Lille, Cité Scientifique, Villeneuve d’Ascq,
France 59655; email: el-ghazali.talbi@univ-lille.fr.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2021 Association for Computing Machinery.
0360-0300/2021/03-ART34 $15.00
https://doi.org/10.1145/3439730
ACM Computing Surveys, Vol. 54, No. 2, Article 34. Publication date: March 2021.
34:2 E.-G. Talbi
In the past 5 years, a lot of effort has been dedicated to automate the design of DNNs. Among
the crucial contributing aspects for this progress are the design of new deep neural architectures
and tuning of their associated hyperparameters. Scaling up DNNs capacity has been known as
an effective approach to improve model quality for several learning tasks. Exact optimization ap-
proaches cannot be applied to such NP-complete optimization problems. A wide variety of specific
heuristics and metaheuristics have been used for architecture and hyperparameter optimization:
random search [15, 107, 109, 157], grid search [177], Monte Carlo Tree Search (MCST) [138, 194],
reinforcement learning (RL) [7, 91], and many families of metaheuristics such as evolutionary al-
gorithms (EAs) and particle swarm optimization (PSO).
Some survey papers related to AutoDNN exist in the literature. Some papers focus on specific
optimization problems such as hyperparameter optimization [14, 60, 126] and neural network ar-
chitecture (NAS) [55, 200]. In Reference [55], the paper is structured according to three high-level
dimensions: search space, search strategy, and performance estimation strategy. Other survey pa-
pers focus on some families of optimization algorithms. In Reference [42], the authors provide
a survey of swarm and evolutionary computing approaches for general deep-learning problems.
Other surveys deal with neuroevolution [172] and reinforcement learning [91]. In Reference [62],
the authors propose a survey of metaheuristics for the training problem.
In this article, a survey of optimization algorithms for AutoDNN is presented. A unified way to
describe the optimization algorithms allow to focus on common and important search components
for all AutoDNN methodologies: representation of DNNs (i.e., search space), formulation of objec-
tive function(s), handling of constraints, initialization of solution(s), and the design of variation
operators (i.e., greedy such as RL, unary operators such as neighborhoods, mutation in EAs and
velocity update in PSO, binary operators such as crossover in EAs, and indirect search operators).
We also extend this unifying view to important optimization methodologies for AutoDNN dealing
with surrogate-based optimization (i.e., Bayesian optimization), multi-objective optimization, and
parallel optimization. A unified taxonomy is proposed in an attempt to provide a common termi-
nology and classification mechanisms. The goal of the general taxonomy given here is to provide
a mechanism to allow comparison between different optimization methodologies. In addition, it is
hoped that the categories and their relationships to each other have been chosen carefully enough
to indicate areas in need of future work as well as to help classify future work.
The article is structured as follows. In Section 2, the main concepts of DNN and metaheuristics
are detailed in a general and unified way. Section 3 formulates the problem and describes its main
characteristics. In Section 4, we present in a unified way the design of the various search compo-
nents of metaheuristics: DNN representation, objective function definition, constraint handling,
solution(s) initialization, and variation operators design (i.e., greedy, unary, N-ary and indirect
operators). In Section 5 (respectively, Section 6 and Section 7), we focus on important aspects in
AutoDNN dealing with surrogate-based optimization (respectively, multi-objective optimization,
parallel optimization). Finally, the Section 8 presents the main conclusions and identifies some
research perspectives.
2 MAIN CONCEPTS
This section provides an overview of the basic components of popular DNNs. Then, it presents in
a unified way the main common search concepts of metaheuristics.
2.1 Deep Neural Networks

DNNs are accurate and efficient learning approaches that represent one of the hottest research
areas in machine learning. DNNs are widely applied in computer vision, NLP, and robotics [67].
They are based on neural network architectures, which interconnect multiple processing layers
Automated Design of Deep Neural Networks: A Survey and Unified Taxonomy 34:3
[79]. DNNs automatically extract features from big unstructured data such as images, text, and
audio. They learn the mapping between the features and predicted classes, layer by layer, through
a transformation of the data, from low-level features to high-level features. This deep feature hi-
erarchy enables DNNs to perform high-performance accuracy in many learning tasks.
DNNs come in two major families: feed-forward and recurrent. In feed-forward DNNs, all the
operations are carried out as a sequence of operations on the outputs of previous layers. In such
DNNs, there is no memory. Feed-forward neural networks process information layer by layer,
while recurrent neural networks have feedback loops between layers allowing them to be used in
time-dependent tasks, such as NLP. One of the most popular feed-forward DNNs is convolutional
neural networks (CNNs). CNNs are composed of three main types of layers: convolutional layers,
pooling layers and fully connected (FC) layers. In general, the training is performed by gradient-
based algorithms (e.g., stochastic gradient descent). CNNs shows impressive results in computer
vision for image and video processing. Many handcrafted CNN architectures have been proposed
such as AlexNet [105], VGG [165], GoogLeNet [180], ResNet [76], and DenseNet [85]. Such DNNs
can be giant and include many layers of different types and millions of hyperparameters.
There are other feed-forward DNNs such as Deep Boltzmann machines (DBMs), Deep Belief net-
works (DBNs), Auto-Encoders (AEs), and Restricted Boltzmann Machines (RBMs). Various single-
layer unsupervised learning models have been proposed and stacked to build DNNs (e.g., sparse-
response RBM (SR-RBM), AE, and denoising AE (DAE)). RBM is a two-layers undirected graph,
composed of one visible layer and one hidden layer with no connections allowed between nodes of
the same layer [80]. An AE is a three-step DNN composed of an input layer, a hidden layer, and an
output layer. The number of units in the input layer is the same as the output layer. The encoder
is defined by the transformation from the input layer to the hidden layer, and extracts the features
from the input data. The decoder transforms the hidden layer to the output layer, and reconstructs
the input data from the features. DBN is a generative model consisting of multiple stacked RBMs
trained by contrastive divergence in a unsupervised way [81]. DBM is a network of symmetrically
coupled stochastic binary units, which contains a set of visible units. There are connections only
between hidden units in adjacent layers, as well as between the visible units and the hidden units
in the first hidden layer.
Recurrent neural networks (RNNs) are specifically designed for time-dependant problems. They
have both feedback and feedforward connections. RNNs have internal memory to allow long-
term dependencies, which will affect the output. Some intermediate nodes compute values that
are stored internally in the DNN. Those internal values are used as inputs to other operations in
conjunction with the processing of a later input. Long Short-term Memory networks (LSTMs) are
the most popular variant of RNNs capable of capturing long-term time dependencies [83].
2.2 Metaheuristics
The AutoDNN problem consists in searching the optimal DNN a ∗ from a set of possible solutions A,
which maximizes an objective function f (a), while satisfying a set of constraints. The search space
A is derived from the representation used to encode DNNs. Metaheuristics represent a class of
general-purpose heuristic algorithms that can be applied to any optimization problem [182]. Unlike
exact methods, metaheuristics allow us to tackle large-scale problems by delivering satisfactory
solutions in a reasonable time. There are two main families of metaheuristics that can be applied
to the AutoDNN problem: single-solution and population-based metaheuristics.
2.2.1 Single-solution-based Metaheuristics. Single-solution-based metaheuristics (S-metaheur-
istics) improve a single DNN. They could be seen as “walks” through neighborhoods or search
trajectories through the search space [182]. S-metaheuristics iteratively apply the generation and
34:4 E.-G. Talbi
replacement procedures from the current single DNN. In the generation phase, a set of candidate
DNNs are generated from the current solution a. This set C (a) is generally obtained by local trans-
formations of the solution. In the replacement phase,1 a selection is performed from the candidate

solution set C (s) to replace the current DNN, i.e., a solution a ∈ C (a) is selected to be the new DNN.
Popular examples of such S-metaheuristics are local search (i.e., gradient-based algorithms), sim-
ulated annealing and tabu search. In addition to the representation of DNNs, their common search
concepts are the definition of the neighborhood structure and the generation of the initial solution.
2.2.2 Population-based Metaheuristics. Population-based metaheuristics (P-metaheuristics)

could be viewed as an iterative improvement of a population of DNNs. P-metaheuristics start from
an initial population of DNNs.2 Then, they iteratively generate a new population of DNNs using
variation operators. Popular examples of P-metaheuristics are evolutionary algorithms (EAs) [44],
ant colony optimization (ACO) [50], particle swarm optimization (PSO) [140], and estimation of
distribution algorithms (EDA) [106].
P-metaheuristics may be classified into two main categories:
• Evolutionary-based: In this category of P-metaheuristics, the DNNs composing the popu-
lation are selected and reproduced using variation operators (e.g., mutation, crossover) act-
ing directly on their representations. A new DNN is constructed from the different features
of solutions belonging to the current population. Evolutionary algorithms (e.g., differential
evolution (DE), evolution strategy (ES), and genetic programming (GP)) represent the most
popular examples of this class of P-metaheuristics.
• Blackboard-based: In the blackboard model, a shared knowledge is iteratively updated
by a diverse group of agents [56]. Here, the solutions of the population participate in
the construction of a shared knowledge. This shared knowledge will be the main input
in generating the new population of DNNs. ACO and EDAs are examples of this class of
P-metaheuristics. For the former, the shared knowledge is represented by the pheromone
matrix, while in the latter strategy, it is represented by a probabilistic learning model.
3 PROBLEM FORMULATION
Three popular formulations of the target optimization problem have been widely investigated in
the literature:
• Neural architectures search (NAS): the goal is to search the optimal network topology
(e.g., number of layers, types of operations, connections between operations) [55]. The hy-
perparameters are supposed to be a priori fixed and/or optimized in an independent post-
processing search process.
• Hyperparameter optimization (HPO): this formulation requires an a priori definition of
the DNN architecture. It consists in fixing the various hyperparameters of the DNN [60].
There are two types of hyperparameters: (1) operation hyperparameters that characterize
the features associated to operations. For instance, the features of a convolution operation
can be the filter size (width, height) and the stride size (width, height); (2) optimization
hyperparameters that characterize the global features of the DNN. An example of global
features are the learning parameters (e.g., learning rate schedules, momentum, batch size)
and regularization parameters (e.g., weight decay, dropout rates).
1 Also named transition rule, pivoting rule, and selection strategy.

2 Some P-metaheuristics, such as ant colony optimization, start from partial or empty solutions.
• Joint optimization (NAS+HPO): the NAS and HPO optimization problems interact in a
way that can render this separation suboptimal. In the AutoDNN joint optimization formu-
lation, the two problems are solved in a joint manner. Neuroevolution (e.g., NEAT [173])
was a popular approach to solve the AutoDNN problem, where both the architecture and
the hyperparameters are optimized in a global way [135]. An important question is related
to the level (i.e., architecture or hyperparameters) in which optimization is carried out at
each iteration. Three strategies can be applied: (1) Global optimization, which consists in
optimizing all levels at the same time [152, 172]; (2) Nested optimization, which consists
in optimizing the different levels in a hierarchical way. At each iteration, the architecture
is optimized, then the hyperparameters for this given architecture are optimized [149]; (3)
Sequential optimization, where the NAS problem is solved first. Then, the hyperparameters
for the obtained final solution are optimized. It is based on a similar concept as the nested
optimization, with the main difference being that the optimization is performed only once
for both the architecture and the hyperparameters.
Let us formulate the general AutoDNN joint optimization problem (NAS+HPO). A DNN a can
be defined by the quadruplet a = (V , E, λV , λa ) where V is a set of nodes denoting the layers (i.e
operations) of the DNN, E is a set of edges (i.e., connections) between operations, λV is the feature
set of operations and λa is the optimization features set of the DNN. The induced graph G = (V , E)
defines the topology of the DNN. Each node has one of k labels, representing the corresponding
operations. The space grows exponentially in both |V | and k. Given the space of all datasets D, the
space of all deep-learning models M, and the search space of architectures A, the optimal DNN
consists to optimize the following objective function: Θ : A × D −→ M. Let d be a given input
dataset, in which dt r ain represents the training set and dval id represents the validation set. The
deep-learning algorithm Θ estimates the model ma ∈ Ma by minimizing,
Θ(a, d ) = arд minma ∈Ma L(ma , dt r ain ),
where L represents the loss function. The problem consists in finding the optimal DNN a ∗ maxi-
mizing the objective function f using the validation data,
a ∗ = arд max a ∈A f (Θ(a, dt r ain ), dval id ) = arд max a ∈A f (a),
where the objective function f can be defined as the negative loss function L, which measures the
accuracy. The most popular loss functions are RMSE (respectively, cross-entropy) for regression
(respectively, multi-class classification) problems [18].
The NAS and HPO can be seen as a reduced AutoDNN problem. Given an DNN topology de-
fined by the graph G, the hyperparameter optimization problem (HPO) consists to find its optimal
hyperparameter configuration: λ∗ = (λV , λa ) ∗ = arд max λ ∈Λ f (a, λ), where Λ represents the set of
all possible values for the hyperparameters, and a is the DNN induced by G. The NAS problem
can be formulated as finding an architecture x ∗ when all architectures are evaluated under a priori
fixed hyperparameter choices: x ∗ = arд max x ∈G = f (x, λ∗ ). The organization of the article and the
proposed taxonomy can be applied for both HPO, NAS, and HPO+NAS problems.
The AutoDNN problem is characterized by the following important properties:
• Large-scale optimization problem: A DNN could be composed of millions of decision
variables. State-of-the-art DNNs have more than hundreds of layers [85] and thousands of
hyperparameters [86].
• Mixed optimization problem: Three different types of decision variables arize in Au-
toDNN: continuous, discrete ordinal and discrete categorical. Continuous variables refer to
real numbers defined within a given interval (e.g., learning rate, momentum). Discrete ordi-
nal (i.e., quantitative) variables are related to measurable integer values. Typical examples
34:6 E.-G. Talbi
Fig. 1. A unified view of problem landscape and search components for AutoDNN metaheuristics and chal-
lenging optimization issues.
are the size of the filter and the stride in CNN pooling operations. Categorical (i.e., quali-
tative) variables are non-relaxable variables defined within a finite set of choices (e.g., type
of operations, training optimizer). It is important to notice that different types of variables
will require different optimization approaches.
• Variable-size design space: The search space of the AutoDNN problem contains condi-
tionality. A decision variable is relevant only if another variable (or some combinations)
takes a certain value. For instance, the number of layers influences the number of per-layer
hyperparameters; the type of operation will induce a different number and type of features
variables. The search space of the problem varies dynamically during the optimization pro-
cess as a function of some variables values [144].
• Extremely expensive black-box objective function(s): The problem has very expensive
objective function(s) that consist in training the whole DNN and computing the quality of
the network (e.g., loss function). When facing very large-scale datasets the learning might
take several hours, days or even months. Morever, the black-box objective function do not
give access to a gradient or the Hessian, and do not have properties such as convexity and
smoothness, which are used in classical optimization.
• Multi-objective optimization problem: The AutoDNN problem can be formulated as a
multi-objective problem in which many different and conflicting objectives are optimized.
Indeed, in addition to maximizing the accuracy, some objectives dealing with cost, size,
energy consumption, inference time of a DNN may be taken into account.
4 SEARCH COMPONENTS
Our survey is based on a unifying view of optimization algorithms according to their main search
components. The most important and common search components in all metaheuristics are the
problem landscape and the search operators (Figure 1). The problem landscape is defined by the
encodings of solutions, which induces the search space, the definition of the objective function(s),
and handling of the constraints. The search operators are mainly the initialization of solution(s)
and the design of variation operators. Three important optimization challenges are highlighted
for the AutoDNN problem: surrogate-based, multi-objective, and parallel optimization. According
to the addressed scientific challenges, some examples are given from both HPO, NAS, HPO+NAS
problems.
4.1 Representation of DNNs

Designing any AutoDNN metaheuristic needs an encoding (i.e., representation) of a solution. The
encoding plays a major role in the efficiency and effectiveness of any metaheuristic and then
Fig. 2. Classification of the different encodings of DNNs.
Fig. 3. Chain-structured DNNs. Different colors define different operations. For a CNN they represent unary
operations such as convolutions, pooling, activation functions, or multivariate operations such as concate-
nation or summation.
constitutes an essential step in designing an AutoDNN metaheuristic. This encoding defines the
search space associated to the problem. Many alternative representations have been used in the
literature [182] (Figure 2):
• Direct representations: The encoding specifies a complete DNN. It describes completely
the architecture and the hyperparameters associated to the DNN.
• Indirect representations: The representation does not encode a complete DNN. A decoder
(e.g., rules, greedy procedure, recurrent neural network) is required to generate the DNN
given by the encoding. The encoding/decoding procedures may be deterministic or non
deterministic.
A complete encoding is a one-to-one mapping and must represent the whole information of a
DNN a defined by a = (V , E, λV , λa ) (Figure 3). On the one hand, the encoding must specify the
architecture of the DNN, which is defined by the graph G = (V , E). Hence, the number of oper-
ations, type of operations, and connections between operations must be given. A general graph
G (e.g., RNNs) can be represented by its binary adjacency matrix, while a DAG (e.g., CNNs) can
34:8 E.-G. Talbi
Fig. 4. Skip connected and multiple edges chain-structured DNNs.
be represented by a lower triangular binary matrix. Indeed, a DAG can be encoded so that all the
directed edges connects nodes from a lower number to a higher number [122, 152]. On the other
hand, the encoding must represent the features of all active operations (e.g., number of filters, size
of filters and strides for a convolution), and the global features of the DNN (e.g., learning rate). The
main property of features encodings are their variable and mixed nature: continuous (e.g., learning
rate), ordinal discrete (e.g., number of layers) and categorical discrete (e.g., type of operations).
4.1.1 Direct Representations. Two main families of DNNs may be found in the literature: flat
DNNs and hierarchical DNNs.
Flat DNNs: DNNs are generally defined as flat networks (e.g., DBN, some CNNs). The most
simple and popular flat network is the chain-structured (Figure 3) [219]. Hence, the topology as-
sociated to DNNs can be represented by DAG (Directed Acyclic Graphs) G = (V , E), where each
node v ∈ V represents an operation (i.e., layer), and each edge e represents a feature map con-
necting two operations. Let us notice Ii the set of input edges associated to an operation vi . The
computation of the output edge O i is: O i = vi (Ii ). The network can be represented by a sequence
of operations such that any operation vi receives its input from operation vi−1 : O i = vi (O i−1 ) [67].
An example of such popular DNNs are VGGNet [165] and AlexNet [105].
Extended flat DNNs include skip connections, highway connections and multiple edges between
operations (Figure 4) [21, 26, 54, 151, 218]. Hence, the incident edges of an operation vi is the union
of O i−1 and other ancestor edges: O i−1 ∪ O j /j < i − 1. Those topologies enable more flexibility in
designing DNNs. Residual networks (ResNet) [76] (respectively, DenseNets networks (DenseNet)
[85]) belongs to this family of architectures, in which the previous operations are summed (re-
spectively, concatenated).
Many different encodings have been used to represent flat DNNs. Linear representations en-
coded by string of symbols of a given alphabet are widely used. DBN networks are generally
represented by linear encodings, which include topological parameters such as the number of hid-
den layers and neurons by hidden layer, and some global features for the contrastive divergence
(e.g., weight cost) and back-propagation (e.g., learning rates for weights and biases). For CNNs,
the presence of many conditioned variables makes that the encoding in intrinsically of variable-
length. For instance, the topology (respectively, hyperparameters) is conditioned by the number
of layers (respectively, type of operation). However, many authors use fixed-length encodings by
assuming some restrictions. In HPO optimization, the architecture (i.e., graph G = (V , E)) is a pri-
ori fixed. Then, a fixed-length mixed linear encoding is mostly used to represent the operations
features lambdav and global features lambdaa of DNNs (e.g., chain-structured architectures [61,
218]). In NAS and HPO optimization, a fixed-length encoding still possible when the number of
operations (i.e., layers) is fixed [191]. Compared to the previous encoding, it will include the set of
operations and their connections [128]. In Reference [7], the type of operations (e.g., convolution,
pooling, fully connected, global average pooling), and hyperparameter settings (e.g., number of
filters, kernel size, stride and pooling size) are considered in a linear fixed-length encoding. When
the number of layers is bounded by a maximal value, the use a fixed-length encoding can also be
an alternative. In Reference [121], the proposed fixed-length mixed encoding includes the num-
ber of layers (ordinal discrete), learning rate (continuous), type of activation function (categorical
discrete) and the gradient algorithm used in training (categorical discrete).
Variable-length encodings is another suited alternative to encode flat DNNs. In References [98,
178], a variable length encoding is used to solve the AutoCNN problem. The encoding represents
different numbers of convolutional layers and pooling layers, and their hyperparameters. In Ref-
erence [5], a variable-length sequence of layers and their respective hyperparameters is used to
solve the AutoCNN problem. The encoding represents the general structure of the network (i.e.,
sequence of layers) and the hyperparameters associated to each layer using a human-readable
context-free grammar. In Reference [189], the encoding is inspired from IP address in computer
networks to represent a variable length encoding of CNNs. An IP address is represented by se-
quence of decimal numbers delimited by full stops (e.g., 192.159.1.354). The network is encoded
by k IP addresses where k is the maximum number of layers. Each layer is represented by an IP
address, and non used layers are disabled.
Nonlinear encodings such as grammars, Cartesien Genetic Programming (CGP) [131, 132, 187],
and tree structures [138, 148] have also been used to encode flat DNNs.
Hierarchical DNNs: In recent years, a widely used network type to tackle the scalability issue
in DNNs is hierarchical networks [116]. They allow to reduce the search space, integrate human
knowledge in the definition of the building blocks, and can be more flexible to solve other learning
tasks (i.e., transfer learning) [219]. Compared to flat networks, they have smaller degree of freedom
in the architecture design. In hierarchical DNNs, the main idea is to have several blocks3 that are
used as building blocks in the design of multi-level DNNs. Many popular hierarchical network have
been handcrafted, including ResNet [76] and DenseNet [85]. Cell-based CNN architectures [219],
inception and xception networks [181] represent the most popular hierarchical DNNs. Except a
three-level model proposed in Reference [13], most of the hierarchical DNNs are composed of two
levels. The inner-level represents the set of primitive buiding blocks, while the outer-level contains
the full architecture, which is a composition of the building blocks. Depending on the optimized
level, different encodings have been proposed:
• Inner-level optimization: In inner-level optimization, the outer-level is fixed while the
inner-level is optimized. The topology of the DNN at the outer-level is a priori fixed (e.g.,
manually). Then, the problem consists in finding the optimal inner-level blocks. In Refer-
ences [125, 184, 202, 213], each inner-level block can be composed of a given number of
layers n. Let k be the number of possible configurations for a layer. Then, the size of the
inner-level search space will be (k × (n − 1)!)b , where b is the number of blocks. In Refer-
ence [198], path encoding is proposed in which they represent the set of directed paths of

a cell. The total number of paths is exponential in n: ni=0 k i while the adjacency matrix
scales quadratically.
Many proposed encodings are many-to-one mappings [27], in which many encodings can
represent the same DNN, and then duplicate evaluations are possible. In References [125,
202], a hierarchical chained structured DNN is proposed. The outer-level is considered
as a sequence of a given number of S connected stages Bs , s = 1, . . . , S (Figure 5). The
3 Also called patterns, modules, stages, segments, motifs, and cells.
34:10 E.-G. Talbi
Fig. 5. Inner-level optimization of DNNs. Template-based hierarchical DNNs chained architectures using
summation as merging operation. Only the inner-level segments have to be configured. Each segment is
defined as a set of predefined operations. The encoding represents the connections between the nodes.
hyperparameters of the stages are fixed. The search space is related to the configura-
tion of inner-level segments. Each segment is defined as a set of n maximal predefined
operations Bs,i , s = 1, . . . , S & i = 1, . . . , n such as convolution, pooling layers and batch
normalization. The proposed encoding is based on a fixed-length binary vector (i.e., size
of n × n − 1 ÷ 2), which represents the connections between the nodes. This encoding is
flexible enough so that many well-known hand-crafted DNNs can be represented such as
VGGnet, ResNet and DenseNet. This encoding is a many-to-one mapping, and induces a
search space of size Λ = S × 2n(n−1)÷2 .
In Reference [219], the authors optimize two different types of inner-level cells: normal
cells, which preserves the dimension of the input, and reduced cells, which reduces the
dimension of the input. The outer level, which defines the overall architecture of the
convolutional nets, is manually predetermined. In Reference [213], a DNA-based encoding
is proposed. A DNN is defined as a fixed-length sequence of inner-level blocks. Each block
is composed of a set of convolution layers with a given maximal number of layers. For each
convolution layer, some hyperparameters have been encoded such as the number of filters
and their size. In Reference [184], the encoding is represented by connecting segments.
Each segment has repeating patterns of operations, and is parameterized by the operation
type and the number of repetitions of the patterns. Each pattern is a sequence of connected
operations.
• Outer-level optimization: In outer-level optimization, the inner-level is fixed while the
outer-level is optimized. This methodology is widely used in cell-based architectures. Cell-
based CNNs are designed by a combination of a priori-defined inner-level cells. The different
types of used cells are predefined. Then, the problem consists in finding the optimal archi-
tecture at the outer-level using this fixed set of inner-level cells [49, 181, 197, 215] (Figure 6).
Because the structures of inner-level cells are fixed, one only needs to optimize the graphs
of cells. Various encodings are used for DNNs in outer-level optimization such as a chain-
strutured [49] and multi-branch chained network [26, 128, 197].
• All levels optimization: Some approaches perform the search in both levels: the outer-
level (i.e., macro-architecture of DNN) and the inner-level (i.e., micro-architecture of blocks)
[113, 216]. In Reference [113], the authors propose a trellis-like network level search space
that augments the block level search space to form a hierarchical architecture search space.
To reduce the complexity of the search, continuous relaxation of discrete variables is per-
formed to be optimized by a gradient algorithm.
This idea of relaxing discrete representations into continuous ones has been explored in many
flat and hierarchical DNNs allowing the application of gradient-based optimization algorithms [2,
114, 117, 156, 162, 188, 211]. In References [35, 117], each operation is encoded by a mixture of
Fig. 6. Outer-level optimization of DNNs. Two different inner-level cells are used: cell1 and cell2. The final
outer-level architecture is optimized by finding the optimal sequence (e.g., cell1, cell2, cell1).
candidate operations, where the operations mixing weights are parameterized by a continuous
vector. Then, the categorical choice of a given operation is reduced to a Softmax over all possible
operations.
4.1.2 Indirect Representations. Direct encodings represent strong specification schemes that
may have a problem with scalability. They require longer encodings as DNN size increases, and
search space will be increased accordingly. Indirect encoding allows a more compact representa-
tion in which the DNN is not completely specified in the representation, although they can be
derived from it. Instead, a decoding strategy (e.g., greedy algorithm, set of rules, recurrent neural
network) is used to decode the generated DNNs. In the case of a non deterministic decoding, in-
direct encodings represent a one-to-many mapping. For the sake of efficiency, we need to be sure
that indirect encodings do not restrict DNNs to some suboptimal class of DNNs [70]. The most
popular indirect representation is based on continuous encodings.
DNNs to/from continuous space: Many research works propose to map discrete architectures
of DNNs to a continuous space. The original DNN architecture a is mapped into continuous rep-
resentation ϵ using an encoding function E : A −→ ϵ [3, 28, 127]. In general, the encoder embeds
DNN architectures into a continuous space through the use of RNNs. The performance predictor
f : ϵ −→ R maps the continuous representation of an architecture into its performance. Then E (a)
is optimized into E(a ) using a continuous optimization algorithm. Afterwards E(a ) is transformed
back to a new architecture a using a decoder. The RNN decoder is responsible for decoding a ,
taking E (a ) as input. The kernel function based on the learned embedding space has to be defined.
In Reference [127], a single-layer vanilla LSTM is the basic model of the encoder/decoder and the
hidden states of the LSTM are used as the continuous representation of a. The optimization al-
gorithm is based on a gradient descent, and the encoder/decoder are trained by minimizing the
combination of performance prediction loss and structure reconstruction loss.
4.2 Performance Estimation of DNNs

The objective function f formulates the goal to achieve. It associates to each DNN a real value,
which describes its quality f : A −→ R. The classical way to evaluate the performance of a DNN
a is to train it on training data and compute its performance on validation data. Accuracy on
unseen data is the most used metric to assess the performance of the learned model. The most
34:12 E.-G. Talbi
Fig. 7. Main approaches for speeding up the computation of the objective function.
time-consuming part of the optimization process is the training of the DNN. Speeding up the
training process is widely used to reduce the computational cost. Low-fidelity optimization allows
to reduce the computational cost by using less expensive objective functions [201]. While these
low-fidelity estimations reduce the computational cost, they also introduce bias in the estimate as
performance will typically be under-estimated. An important property of low-fidelity procedures
is that the relative ranking of architectures remain the same [158]. The main families of the ap-
proaches allowing to speedup the computation of the objective function can be classified as follows
(Figure 7):
• Inheritance: This approach avoid the training from scratch and thereby substantially re-
duces the required training time per DNN. It is based on knowledge transferring between
DNNs. Weight sharing is a well-known approach in which we initialize the weights of the
generated DNNs based on weights of already trained DNNs [25, 146]. Hence, a DNN can be
transformed while leaving some weights unchanged [53, 88, 94, 195]. Instead of a random
initialization of the weights, informed decisions (e.g., Xavier initialization) [64] have also
been used. Pre-trained weights using transfer learning also allows to reduce the huge cost
of training DNNs from scratch [196]. Another popular inheritance-based approach is net-
work morphisms [195]. In the context of DNNs, network morphism refers to a parameter-
transferring map from a given DNN to a generated DNN that preserves its function and
outputs [54]. Morphing types are demonstrated including depth morphing [33], width mor-
phing, kernel size morphing, and subnet morphing [195].
One-shot architectures: represent another interesting approach for weight sharing. The
main motivation is that instead of training hundreds of different DNNs from scratch, one can
train a single large network capable of generating any DNN architecture in the search space.
All architectures are treated as different subgraphs of a supergraph and shares weights
between architectures that have edges of this supergraph in common [117] (Figure 8). The
one-shot architecture4 search consists of four steps [13]: (1) Define a search space to encode
a wide variety of DNNs using a single one-shot model. (2) Train the one-shot single model to
find the weights. (3) Evaluate generated architectures (i.e., subgraphs of the one-shot model)
on the validation set using the pre-trained one shot model. (4) Re-train the best found DNNs
from scratch and assess their performance on the test set. Decoding one-shot architectures
are generally based on sampling independently from a fixed probability distribution. In
Reference [21], a random search is applied. The drawback of one-shot architectures is that
their associated supergraph restricts the search space to its subgraphs [158].
• Reduced training: This low-fidelity approach in training consists in reducing the training
time [209], the number of epochs [178, 217], or the input dataset [103]. For example, one
4 Also called Hypernetworks.
Fig. 8. Example of one-shot DNN cell architecture. It is composed of five separate operations. By sampling,
we can select the two conv 3×3 operations path. To be evaluated, the network will not retrain the weights.
can carry out search on CIFAR-10 and “transfer” the obtained DNN (with some changes,
e.g., changing the number of filters) to ImageNet [112, 203]. Quantization approaches rep-
resent weights using a small set of permitted values, reducing the number of bits required
to store each weight. In References [21, 40], the weights takes binary values, and then the
complexity of multiplications operations will be reduced during training. Existing quanti-
zation methods can be mainly divided into two categories. The first category of methods
seeks to design more effective optimization algorithms to find better local minima for quan-
tized weights. For instance, these works introduce knowledge distillation [153]. The second
category focus on improving the quantization function (e.g., binarization).
• Surrogate5 : an alternative to reduce the complexity of the objective function is the use
of a surrogate. Surrogate models replace expensive objectives with models that provide
an approximation. Section 5 details this important class of optimization approaches, named
surrogate-based optimization.6 In Reference [63], the idea of weight agnostic DNNs has been
proposed, where there is no use of any explicit weight training in the optimization process.
They are supposed to have strong inductive biases that can already perform various tasks
with random weights.
• Downscaled models: Many strategies consist in using downscaled models. Reduction
can be applied to data and network. In data reduction, a partial dataset is used instead
of the whole dataset [189]. Downsampling techniques (e.g., lanczos, nearest, bilinear,
bicubic, hamming, box) have also been used to reduce the resolution of images [36]. In
network reduction, downscaled models are using a subset of the network for training. In
References [209, 219], reduced architectures with less filters per layer and less cells have
been trained.
• Learning curve extrapolation: It describes two different strategies: time-based and data-
based. In time-based learning curve extrapolation, the performance of the training pro-
cedure is learned function from its number of iterations or training time [213]. Different
learning models have been investigated such as logistic regression [179], neural networks
5 Also known as meta-model, reduced model, and approximation.

6 Also called Bayesian optimization.
34:14 E.-G. Talbi
[103], support vector machines regression [8], linear regression [8], random forest [8], and
recurrent neural network (e.g., LSTM) [112]. In Reference [48], the learning curve model
is used to terminate training of DNNs when it is unlikely to beat the performance of the
best found DNN. In data-based learning curve extrapolation, the performance of the train-
ing procedure is learned function of the size of the available dataset for training. In Refer-
ence [148], a training is carried out for a few epochs, and then meta-learner network (e.g.,
RNN) predicts the performance a fully trained network would have.
Low-fidelity approaches can also help to avoid overfitting. When using low-fidelity approaches,
full training are generally applied at the end of the optimization for the best found DNNs [191].
Other adaptive approches wich gradual increase in fidelities during the search have been investi-
gated [57, 108].
4.3 Constraint Handling

Many constraints characterize the AutoDNN problem, such as the number of layers, model com-
plexity, computation cost, memory consumption cost, training time, prediction time, and energy
usage [186]. The constraints may be of any kind: linear or nonlinear, equality, or inequality con-
straints. The proposed constraint handling techniques can be classified as:
• Reject: Reject strategies represent a simple approach, where only feasible solutions are kept
during the optimization process and then infeasible solutions are automatically discarded
[49, 84]. This kind of strategies is conceivable if the portion of infeasible solutions of the
search space is very small. Moreover, they do not exploit any information on infeasible
solutions.
• Penalizing: In penalizing strategies, infeasible solutions are considered during the search
process. The objective function is extended by a penalty function, which will penalizes in-

feasible solutions using for instance linear penalization f (a) = f (a) + λc (a), where c repre-
sents the penalty function (e.g., number of violated contraints, amount of infeasibility) and
λ the weighting factor (e.g., static, dynamic, adaptive). This is the most popular approach
in which many alternatives have been used to define the penalties [184, 188, 216, 218].
• Repairing: Repairing strategies consist in heuristic algorithms transforming an infeasi-
ble solution into a feasible one. A repairing procedure is applied to infeasible solutions to
generate feasible ones [75, 185]. Those strategies are applied in the case where the search
operators may generate infeasible solutions.
• Preserving: In preserving strategies, the encoding and variation operators will insure the
generation of feasible solutions. They incorporate problem-specific knowledge into the en-
coding and search operators to generate only feasible solutions and then preserve the fea-
sibility of solutions. Incorporating prior knowledge about typical properties and allowable
structures of DNNs can reduce the size of the search space and then simplify the search. One
can find the following constraints: maximum number of layers, possible types and number
of operations [178], starting and ending layers [7, 189, 202], possible connections [207], and
fixed sequence of operations [125].
4.4 Initial Solution(s)

The generation of initial solutions(s) has a great impact on the effeciency and effectiveness of
metaheuristics. For a single solution initialization, there is always a tradeoff between the use of
random and “good” initial solutions in terms of the quality of solutions and computational time. In
the initialization of a population of solutions, an additional criterion to deal with is diversification.
If the initial population is not well diversified, then a premature convergence can occur. Many
approaches have been developed for the AutoDNN problem:
• Random generation: Most iterative metaheuristics approaches initialize solution(s) in a
random way (e.g., Gaussian distribution, uniform): EAs [202], PSO [178], DE [190], gradient
[117, 127].
• Heuristic generation: Initial solutions can also generated by low-cost heuristics. In gen-
eral, greedy algorithms (e.g., reinforcement learning) are used for their effectiveness. In Ref-
erence [82], the authors suggest using a lower-dimensional search space to quickly identify
promising areas (e.g., reducing the resolution of images). This information can then be used
to initialize the metaheuristic for the original, higher-dimensional search space.
• Partial architectures: The optimization process starts with reduced small architectures
[23, 63, 122], or well-known skeleton architectures, and tries to augment them. In Refer-
ence [61] (respectively, Reference [191]), the VGGNet (respectively, DenseNet) skeleton is
used. Some metaheuristics start with poor trivial architectures and tries to improve them
by fixing for instance the number of layers and connections [168, 175, 218], and reducing
the type of operations [152]. This approach does not avoid the additional bias introduced
by the skeletons.
• Complete architectures: Some work propose initial solutions based on prior-knowledge
hand-crafted architectures [125] and/or best known DNNs [97]. Other works start with
giant DNNs to be compressed (i.e., dropout, swapout, subgraph search) for various learning
tasks [31, 59, 146, 188, 211]. This approach adds an additional bias introduced by the used
DNN.
• Mixed initialization: For a better compromise between diversification and quality, mixed
strategies may be applied. In Reference [116], a combination between random DNNs and
trivial DNNs (e.g., chain of operations) is developed.
• Diversified population: To our knowledge, there is no work dealing explicitely with diver-
sifying an initial population of DNNs using for instance sequential or parallel diversification
strategies [182].
4.5 Variation Operators

The role of variation operators is the generation of new DNNs during the search process. The
main variation operators found in optimization approaches can be classified as greedy procedures,
unary, n-ary and indirect.
4.5.1 Constructive Procedures. Constructive (i.e., greedy) procedures start from an elementary
or null DNNs and construct a complete DNN by adding operations and connections until a DNN
that is capable of solving the task emerges [142]. Very few greedy algorithms have been proposed
for the AutoDNN problem [122]. Sequential learning (i.e., Markov decision process) approaches
such as RL can be considered belonging to this family of optimization algorithms. In RL approaches,
an agent is trained to select the operation of a DNN in a particular order. The generation of a DNN
architecture is carried out by the agent’s action, in which the reward is based on an estimate of
the performance of the trained architecture on unseen data. The main questions in designing a
RL are: how do they represent the agent’s policy and how do they optimize it. In Reference [7]
the MetaQNN method uses ϵ-greedy Q-learning to train a policy, which sequentially chooses a
layer’s type (e.g convolution, pooling, and fully connected layers), connections between layers,
and corresponding hyperparameters. This approach has been generalized to hierarchical DNNs,
in which a block is repeated to construct a network [214]. In Reference [218], a policy gradient is
applied to approximate the reward function. The author uses recurrent neural network (RNN) policy
34:16 E.-G. Talbi
Fig. 9. Unary variation operators at different levels of a DNN.
to sequentially constructs the DNN architecture. This method has been extended in the state-of-
the-art NASNet approach [219], which constructs repeated blocks composed of convolution and
pooling operations. Multi-armed bandits approaches have also been investigated [108].
In contrast, pruning procedures start from a complete DNN and at each iteration reduce the
complexity of the network by removing nodes or connections [137], in the hope to improve the
generalization of the DNN. A large variety of DNN pruning approaches have been proposed us-
ing different pruning criteria [19]. In the “brain damage” approach [115], the authors remove the
redundant parameters using derivate-related criteria. In Reference [20], the weights are repre-
sented as Gaussian random variables and weights with lower mean value and larger uncertainty
are pruned. In Reference [176], the Hebbian rule is used as a pruning criterion, where more connec-
tions between weekly correlated neurons are skipped. The connecting weights can also be skipped
by regularization terms such as the squared l 2 norm and l 0 − norm [38].
4.5.2 Unary Operators. Unary variation operators transform a single DNN into another one. In
general, it represents a small change (i.e., perturbation) of a DNN. Some important properties must
be taken into account in the design of unary operators: ergodicity, validity, and locality [182]. The
most popular unary operators in metaheuristics are neighborhood in S-metaheuristics (e.g., gradi-
ent) and mutation in EAs. The design of unary operators depends on the type of representations.
For graph-based encodings, it consists in adding/deleting a node or a connection of the graph.
In discrete representations, it generally consists in changing the value associated to an element
by another value. For continuous variables, the most used class of unary operators has the form

x = x + M, where M is a random variable that takes different forms (e.g., uniform random, Gauss-
ian distribution). Unary operators have been applied to all levels of DNNs encodings (Figure 9):
• Architecture: Unary operators at this level consists to update a DAG using for instance
the following operations: add a layer, delete layer, change type of a layer, add a connection,
and remove a connection. Those unary operators have been used in different optimization
frameworks:
— Neighborhoods in S-metaheuristics: In some papers, the authors have relaxed the dis-
crete encoding of DNN into continuous encodings to enable gradient-based optimization
[2, 116, 117, 127, 162]. Hence, gradient-based optimization is applied using classical neigh-
borhoods of continuous variables. A popular gradient-based approach is the differentiable
search approach (i.e., DART) [117]. An indirect encoding is based on a continuous relax-
ation (i.e., mixing probabilities) of the architecture representation (i.e., NAS optimization
problem). The idea of solving the NAS problem within a continuous domain has been
also used in References [2, 156, 163, 188]. The proposed representation is not restricted
to any specific DNN architecture family, and is applicable to both CNNs and recurrent
networks. The current decoding scheme may suffer from discrepancies between the con-
tinuous representation and the induced DNN architecture.
— Mutation in EAs: In flat networks, many mutation operators have been designed. Dis-
crete mutations have been used in DAG representations of CNNs to connect or disconnect
two operations [125, 202], to add a layer, remove a layer [5, 54, 128, 135, 172], replicate a
layer [4]. Continuous mutations have been applied in Reference [164] into a CMA-ES al-
gorithm by relaxing the binary adjacency matrix into continuous matrix and using round-
ing operations. In Reference [148], tree-based mutations have been designed for LSTMs:
(1) mutation to randomly replace an operation with an operation of the same family,
(2) mutation to randomly inserts a new branch at a random position in the tree, and
(3) mutation to shrink the tree by choosing a branch randomly. For hierarchical DNNs,
the same unary operators can be applied at any level of the architecture hierarchy.
• Hyperparameters: Global and operations features of a DNN are generally encoded by a
mixed vector of continuous and discrete values:
— Neighborhood in S-Metaheuristics: Continuous neighborhoods [2, 162] and mixed
neighborhoods [54, 167] have been designed to be used in local search algorithms.
— Mutation in EAs: Discrete mutations have been used in different EA frameworks. In
Reference [175], the (1 + λ)-ES is applied in which λ solutions are generated by discrete-
based random uniform mutation on hyperparameters (e.g., number of filters, size of
filters). In Reference [97], discrete categorical mutations are applied in designing LSTMs,
such as changing the element-wise operation and the activation function. Continuous
mutations have been defined in a DE algorithm [190] and a CMA-ES algorithm [124].
In Reference [124], all discrete and continuous variables are scaled to be in [0, 1], on
which samples λ candidate DNNs are generated from a multivariate normal distribu-
tion. Mixed mutation operators have also been defined for global (e.g., learning rate,
training optimizer) and operations hyperparameters (e.g., activation function, filter size)
[54, 121, 152].
For hierarchical DNNs, the level in which unary opeators are applied can be sampled randomly. In
Reference [116], the authors sample the level k, the building block m at the level k, then a unary
operator is applied to an element of this building block m.
4.5.3 N-ary Operators. Unlike unary operators, n-ary variation operators recombine a set of n
DNNs into another one: A × A . . . × A −→ A. Their role is to inherit the building blocks of a set of
DNNs to generate new DNNs. The most popular n-ary operators in metaheuristics are crossover
in EAs and velocity update in PSO. The most used crossover operators are the 1-point crossover
and its generalization the n-point crossover, in which n crossover cuts are randomly selected and
the solution is generated by interchanging the segment of each parent. In uniform crossover, each
element of the solution is selected randomly from either parent, which is more flexible for mixed
vectors. For continuous variables, one can add arithmetic and geometrical crossover operators. In
PSO, the velocity updating of any particle x i is a computed function of the global best DNN дBest
and local best DNN pBesti . As for unary operators, the design of binary operators depend mainly
on the variables to be inherited:
34:18 E.-G. Talbi
Fig. 10. An example of crossover operator inherting and recombining building blocks [125]. Blue (respec-
tively, red arrows) represent common (respectively, noncommon).
• Architecture: n-ary operators have been used in different optimization frameworks:

—Crossover in EAs: In flat DNNs using linear encodings, a 1-point crossover has been
designed to recombine the layers [5, 122, 190]. Crossover operators specific to tree en-
codings of architectures have been also developed (e.g., Homologous crossover for LSTMs
[148]). In hierarchical DNNs, a uniform crossover applied at the level-1 blocks has been
used in Reference [202]. Each pair of corresponding blocks are exchanged with a given
probability. In a binary encoding of a DAG, a crossover operator preserves the common
building blocks shared between both parents by inheriting the common bits from both
parents [125]. Then, it maintains, relatively, the same complexity between the parents
and their offsprings by restricting the number of “1” bits in the offspring’s bit-string to
lie between the number of “1” bits in both parents (Figure 10). In general, all the values
at lower levels are inherited from the crossover operator involving higher levels [190].
— Velocity update in PSO: It needs to have a fixed length for all particles. Hence, new ve-
locity updates have been designed for variable-length representations. In Reference [178],
the authors used truncation and padding to deal with variable-length encodings. In Ref-
erence [189], a fixed-length bounded by the maximum length in which disabled layers
are encoded in the representation and participate to the velocity update.
• Hyperparameters: Any classical n-ary operators can be applied to mixed vectors charac-
terizing the global and operations features of a DNN:
— Crossover in EAs: Unlike n-point crossovers, the uniform crossover is well adapted to
fixed-length mixed encodings [121].
—Velocity update in PSO: Classical velocity updates are based on fixed-length continu-
ous vectors. In Reference [61], the discrete variables are relaxed to continuous variables,
such that the classical velocity update is applied. Then, a cast (i.e., rounding) operation is
carried out.
Other n-ary operators have been applied in other metaheuristics. In Reference [174] a tree
growth algorithm (TGA) has been developed for AutoCNN. The n-ary operator consists in
moving N 2 solutions yi to the distance between the closest best solutions x 1 and x 2 , by
producing linear combinations: yi = λx1 + (1 − λ)x2. For discrete variables, the obtained
values are rounded to the closest integer value.
4.5.4 Indirect Operators. The solutions of the population participate in the construction of a
shared knowledge. This shared kowledge will be the main input in sampling the new population
of DNNs. The recombination in this class of algorithm between solutions is indirect through
this shared memory. ACO and EDA are the most popular algorithms belong to this class of
Pmetaheuristics:
• Ant colony optimization (ACO): The shared knowledge is represented by the pheromone
matrix. ACO have been developed to design the LSTM cell structure of the network. LSTMs
are generated by a given number of ants, and having them choose a path through the fully
connected DNN biased by the amount of pheromone on each connection. The good quality
generated DNNs are used to update the pheromone, reinforcing the features (i.e., connec-
tion between operations) that provide good solutions [46, 52]. The same approach has been
developed for CNN [23]. For a given depth of the CNN, each ants constructs a complete
CNN by selecting the next operation by using the global pheromone.
• Estimation of distribution algorithms (EDA): The shared knowledge is represented by a
probabilistic learning model. In Reference [125], a Bayesian optimization algorithm (BOA)
has been developed to find inherent correlations between the decision variables. In Au-
toDNN, this translates to correlations in the blocks and paths across the different segments.
Exploitation uses past information across all networks evaluated to guide the final part of
the search. More specifically, if we have a network with three segments s 1 , s 2 , and s 3 , then
by using the history of generated solutions, the operator constructs a Bayesian Network
relating those variables. It consists in modeling the probability of networks beginning with
a particular segment s 1 , the probability that s 2 follows s 1 , and s 3 follows s 2 . Those estimates
are updated during the search, and new offsprings are generated by sampling from this
Bayesian Network.
Other optimization approaches use indirect operators. A Bayes Monte Carlo procedure has been
used [29]. A set of DNNs are sampled. Then, a probability distribution over high-performing DNNs
is learned.
5 SURROGATE-BASED OPTIMIZATION
Surrogate-based optimization (SBO)7 is a popular approach to deal with the AutoDNN problem.
These algorithms are iterative sampling procedures relying on surrogate models (i.e., metamodels,
approximation) of the considered objective function, which are generally characterized by an ex-
pensive computational cost [10, 161]. They iteratively determine and explore the most promising
solutions of the design space, thus simultaneously refining the surrogate model and converging
towards the problem optimum [95]. First, a set of diversified observations D n are generated using
for instance Design of Experiments (DoE) or Latin Hypercube. Using this set of observations D n ,
a surrogate s ( f ) : A −→ R of the objective function f is constructed. Then, it consists in sampling
iteratively, using the surrogate, the most promising solution x n+1 ∈ arд max qs (f ) based on an
infill sampling criterion (i.e., acquisition function) qs (f ) : A −→ R. Usually the acquisition function
uses exploiting and exploring sampling principles. The solution x n+1 is evaluated using the real
objective function yn+1 = f (x n+1 ) and is added to the set of observations D n+1 = D n ∪ (x n+1 , yn+1 ).
The surrogate is updated s ( f /D n+1 ) using the new acquisition function qs (f /Dn+1 ) , and a new so-
lution is sampled, and so on, until a given budget of evaluated solutions is finished (Figure 11).
Notice that the evaluation of the acquisition function q is much cheaper than the original function
f , which means that the optimization effort is reduced.
The various surrogate-based metaheuristics for AutoDNN can be characterized by:
• Surrogate model: There are at least two desired properties for a surrogate: correlation with
the true objective and sample efficiency. The most popular surrogate model in AutoDNN is
the Gaussian process [14, 65, 88, 100, 110, 168]. Gaussian processes are suited to continuous
optimization problems and are characterized by poor scalability to high dimensions [169,
7 Also known as Bayesian Optimization.
34:20 E.-G. Talbi
Fig. 11. The general framework of surrogate-based optimization algorithms.
170]. Then, other models have been investigated such as neural networks [167, 169, 170],
radial basis functions [28, 90], polynomial regression models [128], Tree Parzen Estimator
(TPE) [14], RNNs [28, 49], graph neural networks [128], and random forest [88]. A recent
trend consists in using multiple surrogates (i.e., ensembles of metamodels) to improve the
accuracy of the surrogates [28].
• Acquisition function: The acquisition function determines the utility of different DNN
candidates. They are based on a tradeoff between exploration by searching where predicted
variance is high, and exploitation by searching where expected value is minimized. Dif-
ferent infill criteria have been used for updating the surrogate: lower confidence bound
(LCB), upper confidence bound (UCB) [94, 168], probability of improvement (PI), expected
improvement (EI) [14, 28, 100, 128], independent Thompson sampling [198], and predictive
entropy search (PES) [78].
• Target optimization problem: Several techniques exist in SBO of continuous functions.
Hence, SBO has been widely used in solving the HPO problem. For instance, it has been
applied to tune the number of layers and the size of hidden layers in DBNs [14] and deep
feed-forward neural networks [179], the size of the filter bank, and other hyperparameters
(e.g., learning rate) in CNNs [16, 92, 133, 168, 209].
Although SBO has seen great success in the HPO problem, several issues arise when it
comes to solve the NAS problems because of the discrete variables. The advantage of us-
ing a continuous encoding (e.g., DART approach [117]) is that it allows to use traditional
and efficient surrogate models [10]. Only few methods have been developed for mixed con-
tinuous/discrete problems [145]. Indeed, using SBO for NAS requires so far specifying a
distance function between DNN architectures, to define a surrogate model (i.e., kernel func-
tion). The kernel function, which measures the similarity between network architectures,
is fundamental for selecting the architectures to evaluate during the search process [28,
100, 198]. As modern DNNs can have multiple layers, multiple branches and multiple skip
connections, comparing two DNNs is non-trivial. In Reference [28], the authors propose to
map a diverse range of discrete architectures to a continuous embedding space through the
use of RNNs and then define the kernel function based on the learned embedding space. In
Reference [100], the authors develop a distance metric in the space of CNN architectures,
which is computed via an optimal transport algorithm.
• Optimization algorithms: Acquisition functions are generally non-convex, high-
dimensional, and intractable [199]. Hence, different metaheuristics have been investigated
to optimize the acquisition function such as EAs [100, 128], gradient [28], EDA [14], CMA-
ES [14], random procedures [198], and simulated annealing [94].
6 MULTI-OBJECTIVE OPTIMIZATION
Most of the work on AutoDNN formulate the problem as a single-objective problem based on the
accuracy. However, many applications do not only require high accuracy on unseen data but also
other objectives (e.g., inference time, model size, energy consumption). A multi-objective opti-
mization problem (MOP) can be defined as [134] mina ∈A ( f 1 (a), f 2 (a), . . . , fk (a)), where k (k ≥ 2)
is the number of objectives, and A denotes the set of feasible DNNs. Contrary to single-objective
optimization, the solution of a MOP is not a single solution, but a set of solutions known as
Pareto optimal set, which is called Pareto front when it is mapped in the objective space. Any
solution of this set is optimal in the sense that no improvement can be made on one objec-
tive without worsening at least another objective. A solution a dominates a solution b if and
only if: ∀i ∈ [1..k] : fi (a) ≤ fi (b) and ∃ ∈ [1..k] : fi (a) < fi (b). The Pareto optimal solutions
are not dominated by any other solutions in the feasible space. A solution a is Pareto optimal
iff: ∀b ∈ A, ∀i ∈ [1..k], fi (a) ≤ fi (b) and f (a) f (b).
6.1 Multi-objective Single-task Learning

In classical single-task learning problems, DNNs give high accuracy at the cost of high-
computational complexity (e.g., billions of FLOPs). Recently, AutoDNN approaches have been ap-
plied to applications requiring light-weight models and fast run-time. It can be infeasible to run
real-time applications on resource constrained platforms such as IoT, smartphones, robots, drones,
autonomous vehicles, and embedded systems. Indeed, those platforms are often constrained by
hardware resources in terms of power consumption, available memory, available FLOPs, and la-
tency constraints. Optimizing those multiple objectives will enable efficient processing of DNNs
to improve energy efficiency and throughput without sacrificing application accuracy or increas-
ing hardware cost. This is a critical aspect to the wide deployment of DNNs in AI systems. Many
device-related and device-agnostic objectives have been investigated in the literature for the opti-
mization and/or the inference steps:
• Energy consumption: In using DNN models in low-power mobile and embedded areas,
there is a need to optimize the energy consumption (i.e., power) [88, 143]. Power can be es-
timated via analytical models [24, 154], simulation software [77], or measured on the target
device (i.e., hardware-aware) [88]. It depends if the platform where the DNN is designed
and the platform on which it is deployed are connected.
• Inference speed: The inference time is an important objective for real-time applications
[102]. To measure this objective, it is necessary to deploy DNNs on the target hardware
device [49, 88].
• Computational and memory cost: This cost can be estimated by the number of floating-
point operations (FLOPs) [125], and memory usage that a network performs during a for-
ward phase [49, 84, 167]. This measure can concern both training and inference [191].
• Hardware cost: The cost of the hardware on which training and/or designing are carried
out can also be taken into account [120].
• Number of parameters: Minimizing the number of parameters of DNNs has been used
in References [93, 121, 184] as a second objective for an efficient deployment of DNNs on
constrainted hardware (e.g., mobile devices [53]).
• Size of the network: It is mainly evaluated by the number of connection in the network
[63]. The connecting sparsity has been considered in designing DBNs [118] and RNNs [166].
In Reference [72], an objective consists in minimizing the number of non-zero weights. The
resulting compressed networks will have lower bandwidth requirements and require fewer
multiplications due to most weights being equal to zero.
34:22 E.-G. Talbi
• Diversity: Ensemble models using diverse DNNs tends to achieve better generalization
[22]. Diversity measures the discrepancy between the output of a DNN and the outputs of
other DNNs. An example of such diversity wich measures the total correlation between the
output of one DNN and the output of each of the other DNNs is [30]

N
M
Min Dm = i
om − Oi oij − O i ,
i=1 j=1,ij
where M is the number of DNN models, N the number of samples, om i , o i represent the
j
output of the mth and the jth DNN for the ith training sample, respectively, and O i denotes
the average output for all DNNs. In Reference [30], the Pareto DBNs networks are combined
to form an ensemble model, where combination weights are optimized via a single-objective
DE for a given learning task.
The aim of solving MOPs is to help a DNN designer to find a Pareto DNN that copes with his
preferences. One of the fundamental questions in MOPs resolution is related to the interaction be-
tween the problem solver (e.g., metaheuristic) and the designer. Indeed, the Pareto DNNs cannot
be ranked globally. The role of the designer is to specify some extra information to select his fa-
vorite solution. This interaction can take one of the three following forms: a priori [84], a posteriori
[30, 166], and interactive. To our knowledge there is no work dealing with interactive design of
DNNs, where there is a progressive interaction between the designer and the optimizer. Different
optimization approaches have been designed for multi-objective autoDNN:
• Scalarization approaches: Those approaches transform the MOP problem into a single-
objective one or a set of such problems. Among these methods one can find the aggregation
methods, weighted metrics, Tchebycheff method, goal programming methods, achievement
functions, goal attainment methods and the ϵ-constraint methods [134]. In Reference [84],
a weighted sum function α f 1 + (1 − α ) f 2 , which aggregates accuracy and energy consump-
tion has been used to solve a bi-objective optimization problem. In Reference [3], the au-
thors provide a balance between the compression ratio and the accuracy using the function
A(x )
f (x ) = C (x )(2 − C (x )) × A(r ef ) , where C (x ) is the compression ratio of the architecture x,
A(x ) is the validation performance of x and A(re f ) is the validation performance of the
#par am(x )
reference network. The compression ratio C (x ) is defined as C (x ) = 1 − #par am(r ef ) .
• Pareto approaches: Dominance-based approaches8 use the concept of dominance and
Pareto optimality to guide the search process. Population-based metaheuristics are partic-
ularly suitable to solve MOPs, because they deal simultaneously with a set of solutions that
allows us to find several Pareto DNNs in a single run of the algorithm. The main differences
between the various proposed approaches arise in the following search components: fit-
ness assignment, diversity management, and elitism [183]. Pareto EAs (e.g., NSGA-II: Non-
Sorting Genetic Algorithm) have mostly been used in the literature for designing CNNs
[102, 121, 125], RNNs [166] and LSTMs [11]. Other Pareto optimization algorithms have
also been considered such as PSO (e.g., diversity based on crowding and dominance based
on ϵ-Pareto dominance [191]), and local search [167].
• Decomposition-based approaches: Most of decomposition-based algorithms in solving
MOPs operate in the objective space. One of the well-known frameworks for MOEAs using
decomposition is MOEOA/D [210]. It uses scalarization to decompose the MOP into multiple
scalar optimization subproblems and solve them simultaneously by evolving a population
8 Also named Pareto approaches.
of DNNs. Subproblems are solved using information from the neighbouring subproblems
[129]. This approach has been developed using Tchebycheff scalarization in designing DBNs
[30, 118].
Most of the proposed MOP formulations are bi-objective. Very few many-objective models (i.e.,
more than three objectives) have been investigated. In Reference [53], a five-objective MOP has
been formulated: accuracy on data set CIFAR-10, accuracy on data set CIFAR-100, number of pa-
rameters, number of add-multiply operations and inference time. Compared to accuracy, the pro-
posed additional objectives (e.g., inference time, energy consumption) are generally cheap to eval-
uate. Hence, developing new MOP approaches that take into account this high heterogeneity in
the computational cost of the objectives is essentiel. An approach based on decoupled objective
evaluations has been proposed to enable independent evaluations across objectives [77]. In Ref-
erences [53, 88], a sequential approach is developed in handling cheap and expensive objective
functions. First, cheap objectives are used to sample new solutions. Then, in a second phase, ex-
pensive objectives participate in the search process to generate Pareto DNNs for the whole MOP.
In surrogate-based MOP, new acquisition functions have to be developed. To identify Pareto-
optimal DNNs, an acquisition function based on the hypervolume indicator has been proposed in
Reference [88]. In Reference [88], the authors consider surrogate-based MOP with heterogeneous
cost objectives. The acquisition function selects the objective across which the configuration will
be evaluated in addition to selecting the next DNN to evaluate. A trade-off is made between the
additional information obtained through an evaluation with the cost of obtaining it.
6.2 Multi-objective Multi-task Learning

Multi-task learning (MTL) allows to learn multiple different yet related tasks simultaneously. MTL
has recently been explored in a variety of DNNs solving problems in computer vision [17] and
NLP [43]. The number of parameters in a multi-task DNN would be less than in multiple DNNs
optimized for their own single task. In addition, the trained DNNs for MTL should be able to
synergize, enabling superior performance over learning each task independently using smaller
datasets per task [99]. The MTL problem is inherently multi-objective, in which the various tasks
may conflict. Hence, some trade-off models represented by Pareto solutions have to be found. In
the literature MTL is mostly solved as a single objective optimization problem via hard or soft
parameter sharing [155]. In hard parameter sharing, a subset of parameters is shared between
tasks, while other parameters are task specific. In soft parameter sharing, all parameters are task
specific, but they are jointly constrained via a Bayesian prior or a joint dictionary. In the design of
a global model that shares some parameters across tasks, the parameters can be learned by solving
a MOP that takes into account all uncertainties on the defined tasks.
Very few works in the literature investigate a multi-objective approach to solve the MTL prob-
lem. Various objectives (i.e., loss functions for different tasks) can be handled in the formulation
of the problem: Min Θ (L1 (a, θ c , θ s1 ), . . . , Lt (a, θ c , θ st ), . . . , LT (a, θ c , θ sT )), t = 1, 2, &,T , where Lt is
the loss function function of task t, T is the total number of tasks, θ c are the shared parameters,
and θ st are the specific task parameters. Hence, Pareto solutions representing potential optimal
architectures will be generated to solve the MTL problem. To our knowledge only scalarization
approaches have been proposed. In References [111, 159], a weighted linear aggregation of the
per-task losses has been applied and solved using gradient-based algorithms. In Reference [66],
many weighting approaches have been evaluated including uniform combination of losses, dy-
namic weight average (DWA) [119] and uncertainty weighting methods [101] with various sizes
of datasets per-task.
34:24 E.-G. Talbi
Fig. 12. Parallel models of metaheuristics for AutoDNN.
7 PARALLEL OPTIMIZATION
On the one hand, AutoDNN problems are more and more complex (e.g., dataset and network size)
and their resource requirements in terms of computation and memory are ever increasing. Al-
though the use of metaheuristics allows to significantly reduce the computational complexity of
the search process, it remains time-consuming. On the other hand, the rapid development of tech-
nology in hardware design (e.g., GPU, TPU) makes the use of parallel computing increasingly pop-
ular. State-of-the-art DNNs required 3,150 and 2,000 GPU days [152, 219]. Parallel optimization can
be used for the following reasons: speedup the search, improve the quality of DNNs, reduce the
energy, improve the robustness, and solve large-scale and/or complex learning tasks. In this arti-
cle, we make a clear distinction between the parallel design aspect and the parallel implementation
aspect.
7.1 Parallel Design

In terms of designing parallel metaheuristics for AutoDNN, three major parallel hierarchical mod-
els are identified (Figure 12):
• Algorithm-level: In this parallel model, independent, cooperating, or competitive self-
contained metaheuristics are used. If the different metaheuristics are independent, then
the search will be equivalent to the sequential execution of the metaheuristics. However,
cooperative and competitive models can alter the behavior of the metaheuristics and enable
the improvement of the quality of DNNs by providing better convergence and diversity.
Very few algorithm-level parallel models have been investigated in the literature. A parallel
independent approach has been investigated for transfer learning [59]. A set of parallel
asynchronous agents learn how to reuse the architecture of an existing DNN [156] for new
learning tasks. An agent based on an EA is used to evolve a subpart (i.e., PathNets) of a giant
DNN. PathNet may be thought of as a form of “evolutionary dropout” in which instead of
randomly dropping out nodes and their connections, “thinned” DNNs are evolved in parallel
for various learning tasks.
Some parallel cooperative models has been developed for neural network design [47,
135]. The algorithm is based on the popular “Island Parallel EA” in which a set of parallel
EAs are cooperating to solve the problem [182]. In Reference [135], two populations of DNN
cells and topologies are evolving in parallel. During evaluation, the cells are combined into
topologies to create a larger assembled DNNs. An example of algorithm-level competitive
parallel model can be found in designing generative neural networks (GANs). GANs
are composed of two adversarial DNNs: a generator and a descriminator [68]. The two
networks are confronted in a zero-sum game. The generator creates fake noisy input data
to deceive the discriminator, while the discriminator learns to distinguish between real and
fake samples. In contrast to conventional GANs, which alternate the update of the generator
and a discriminator, some algorithm-level parallel EA models have been proposed [39, 193].
In Reference [39], a co-evolutionary approach has been used, in which the discriminator
and generator population of networks are trained simultaneously as adversaries. Two pop-
ulations of generators and discriminators evolve in parallel following its own optimization
process. The discriminator D (respectively, generator G) networks optimize the following
loss function: L D (D, G) = −Ex dat a [loдD (x )] − Ez noisy [loд(1 − D(G (z)))] (respectively)
−Ez noisy [loд(D (G (z)))], where data represents the input dataset, z (respectively, noisy)
represents the noisy data (respectively, noise distribution).
• Iteration-level: In this model, an iteration of a metaheuristic is parallelized. The behavior
of the metaheuristic is not altered. The main goal is to speedup the algorithm by reduc-
ing the search time. Indeed, the iteration cycle of metaheuristics requires a large amount
of computational resources for training. The most popular iteration-level parallel model
consists in evaluating in parallel the generated DNNs. Neighborhood decomposition in lo-
cal search-based metaheuristics consists in partitioning the neighborhood at each itera-
tion. Then, each partition of DNNs can be evaluated in parallel. Population decomposition
in population-based metaheuristics (e.g., EAs, ACOs, PSOs) consists in generating sub-
populations of DNNs and evaluates them in parallel. In the synchronous mode, a master
manages the search process. At each iteration, the master distributes the set of new gen-
erated DNNs among the workers and waits for the results of all DNNs (e.g., EAs [130, 152,
208], (1 + λ)ES [131], PSO [123, 191], multi-armed bandits [57]). While the results are col-
lected, the search process is iterated. In the asynchronous mode, the evaluation phase is not
synchronized with the other parts of the search process in EAs [116] and ACO [52]. The
master does not wait for the return back of all DNNs evaluations to start the next iteration.
The steady-state EA is a good example illustrating the asynchronous model [116].
• Solution-level: In this model, the parallelization process handles the training of a single
DNN, which is the most costly operation [12]. Training broadly comprises iterations over
two dataflows steps: the forward step for training the sample data, and the backward step
for updating weights (e.g., computing gradients). Four solution-level parallel models may
be carried out for training:
—Data-based decomposition: The same DNN model is duplicated among different work-
ers with different portions of the training data [45]. The computations are carried out in
parallel on different data partitions. In Reference [147], each worker stores an identical
copy of the model and computes gradients only on a partition of the training examples,
and these gradients are aggregated to update the model.
— Function-based decomposition: The DNN model is partitioned into different sub-
functions. Each sub-function is evaluated in parallel using the same training data. Then,
a reduction is performed on the results returned back by the computed sub-functions. By
definition, this model is synchronous, so one has to wait for the termination of all work-
ers calculating the operations. Three different levels of function decomposition can be
applied: (1) Objective level in which different objective functions are evaluated in parallel
34:26 E.-G. Talbi
such as in multi-objective AutoDNN. (2) Model level in which different sub-models (e.g.,
operations) are handled in parallel. For example, in Reference [104], different workers
train different parts of the model. A convolution with k filters can be splitted in n opera-
tions, each of which convolves its input with nk filters. (3) Operation level in which a given
operation (e.g., convolution) is handled in parallel. For instance, a FC layer can be modeled
as matrix-matrix multiplication and is well suited to be performed in parallel [12].
— Pipeline decomposition: It consists in designing a pipeline of the layers composing
a DNN, where one or more consecutive layers of a DNN form a chunk. The layers in
a chunk are executed on one worker, and thus, different workers compute the DNN in
a pipelined manner [74, 86]. This parallel computing model is efficient for large DNNs
and/or large datasets.
— Combined decomposition: The previous strategies can be combined. For instance, the
function, data parallel and pipeline models can be used jointly. A combined parallelisation
mixing functional and data parallelism has been proposed in References [104, 205]. In
Reference [104] the authors use data parallelism in the convolutional layers (i.e., compute
intensive) and function parallelism in the FC layers (i.e., memory-intensive). Very few
papers combine pipelining, function parallelism, and data parallelism [74].
7.2 Parallel Implementation

Parallel implementation of AutoDNN metaheuristics deals with the efficient mapping of a parallel
model of metaheuristics on a given parallel architecture. Computational throughput, power con-
sumption and memory efficiency are three important indicators in parallel architectures. Parallel
architectures are evolving quickly and are dominated by two types of architectures: shared memory
architectures (e.g., multi-core CPU, accelerators such as GPU) and distributed memory architectures
(e.g., clusters of CPUs).
Shared-memory architectures: Accelerators and multi-core CPUs represent the most pop-
ular shared-memory architectures. Accelerators are often connected with a server through PCIe
bus. They can be classified as temporal or spatial architectures. Popular temporal architectures are
multi-cores and GPUs. They use Single Instruction Multiple Threads (SIMT) and Single Instruction
Multiple Data (SIMD) as parallel computing models. They use a centralized control for a large num-
ber of ALUs, which can only fetch data from the memory hierarchy. Due to their high throughput
support and an architecture designed specifically for data parallel workflows, GPUs (e.g., Nvidia,
AMD series) are well adapted for DNN computational requirements. Using CUDA API or OpenCL,
these frameworks boost their scale-up efficiency using threads to utilize multiple GPUs in a single
node (single address space). There are various heterogeneous architectures integrating CPU with
GPUs at chip level such as AMD APUs [9, 71]. However, conventional CPUs and GPUs are energy-
inefficient due to their effort for flexibility, and then they are not preferred for power constrained
applications [58]. Spatial architectures use dataflow processing, where a processing sequence is
composed of ALUs transfer data from one to another. FPGA and ASICS (Application-Specific inte-
grated Circuits) are the most widely used spatial architectures. FPGA allows to implement irreg-
ular parallelism, customized data type and application-specific hardware, offering great flexibility
to accommodate new DNN models. However, their drawbacks are the on-chip memory limitation,
and the lack of efficient high-level APIs. A major improvement in cost-energy performance comes
from domain-specific hardware, such as Tensor Processing Units (TPUs). They are AI-dedicated
ASIC, which targets a high volume of low-precision (e.g., 8-bit) arithmetic while maintaining low
power consumption. However, they have less flexibility and longer development cycle than FPGAs
[192].
Distributed-memory architectures: Clusters of CPU nodes represent the most popular

distributed-memory architecture. The computer nodes are connected by high speed networks such
as modern Ethernet and InfiniBand. The most important metrics for the interconnection network
are latency and bandwidth. Message Passing Interface (MPI) is the omnipresent programming
model for distributed memory architectures. The performance of single-node multi-GPU is near-
ing saturation for large datasets and DNN models. Thus, scale-out efficiency with large clusters
of heterogeneous nodes (e.g., CPU-GPU) is an emerging topic. Most of the top high-performance
computing (HPC) systems9 are composed of clusters of heterogeneous nodes (CPU and GPU) mix-
ing shared-memory and distributed-memory models. Hence, traditional HPC applications have
been successfully redesigned to scale-out using a hybrid programming model mixing MPI and
CUDA.
Parallel implementation of AutoDNN metaheuristics on parallel hardware has to be consider
maximizing accuracy and throughput, while minimizing energy and cost. The throughput of a par-
allel metaheuristic on a given parallel architecture depends mainly on its granularity. It computes
the ratio between the computation cost and the communication cost. The three parallel models
have a decreasing granularity from large-grained to fine-grained:
• Algorithm-level: This model has the largest granularity. There is a relatively low commu-
nication requirements. It is the most suited parallel model to conventional parallel architec-
tures such as HPC systems and clusters of multi-cores. In terms of scalability, its degree of
concurrency is limited by the number of metaheuristics involved in solving the problem. In
Reference [59], an implementation on a cluster of CPUs using 64 asychronous independent
algorithms has been carried out.
• Iteration-level: A medium granularity is associated to the iteration-level parallel model.
As the objective function is very expensive, this model has been widely implemented on
multi-GPUs and clusters of multi-cores. It has been deployed efficiently for EAs [130, 191,
208], PSO [123] on GPUs powered clusters using MPI-CUDA, and ACO on clusters of multi-
core CPUs using MPI [52]. The degree of concurrency of this model is limited by the size of
the neighborhood for S-metaheuristics or the size of the population for P-metaheuristics.
The use of very large neighborhoods and large populations will increase the scalability of
this parallel model. Introducing asynchronous communications in the model will increase
the efficiency of parallel metaheuristics [130].
• Solution-level: This model has the finer granularity and is therefore adapted for accel-
erators such as GPUs, FPGAs and TPUs [198]. The degree of concurrency of this parallel
model is limited by the number of objective functions, data partitions and layers of DNNs.
Most of the existing parallel implementations have been carried out on single GPUs. Few
parallel models have been implemented on specific hardware, such as FPGAs [136] (e.g.,
CNNs [73, 192], LSTMs [212]), and Arm processors [120]. The size of DNNs raises some
problems according to the GPU memory. The system would crash because of a shortage of
GPU memory. Many approaches have been proposed to find memory-efficient DNNs [184].
Some parallel implementation have been developed on a single server with multiple GPUs
having disjoint memory spaces [141, 205]. The single-server implementation scales only to
8 GPUs before the host server becomes overburdened by I/O, power, cooling, and CPU com-
pute demands. Multiple servers where each server represents a cluster of multi-core and/or
multiple GPU represent a more scalable implementation [37]. As communication is the ma-
jor bottelneck in large clusters of GPUs, many techniques have been proposed to overlap
9 Top500.
34:28 E.-G. Talbi
communication and computation [51, 89, 160]. Many solution-based parallel models have
been investigated:
— Data-based decomposition: Each node (e.g., GPU) trains on its own data partition while
synchronizing weights with other nodes, using either collective communication primi-
tives [69] or sharing memory with servers [41]. Data-based decomposition requires syn-
chronous communication between nodes, since each node must communicate both gra-
dients and parameter values on every update step [37]. Moreover, the mini-batch size gets
multiplied by the number of used nodes.
— Function-based decomposition: It has been implemented on large clusters of CPUs
[45], and HPC clusters of heterogenoeus nodes (i.e., multi-core CPU/GPU) using CUDA
and MPI [139]. The operation level is always handled by single node accelerators. For
instance, convolution in CNN and gate systems in RNNs (i.e., matrix-matrix multiplica-
tion) are generally implemented on fine grained architectures such as vector accelerators
of CPUs or many-core architectures (e.g., GPU) [12]. The model level is generally imple-
mented on clusters of GPUs and/or CPUs [12, 37]. The objective level is generally imple-
mented on heterogenous architectures. For multi-objective AutoDNN, one can decouple
the evaluation of heterogeneous objectives on different hardware platforms. For instance,
one can evaluate the accuracy on non-dedicated hardware and energy consumption on
specific hardware [77].
— Pipeline decomposition: limited network bandwidth hardware induces high
communication-to-computation ratios. Pipelining different micro-batches on sub-
functions of layers allows to benefit memory utilization and thus make fitting giant
models feasible. GPipe provides the flexibility of scaling a variety of different networks
to gigantic sizes efficiently, and has been implemented on a single server with TPUv3s
and NVIDIA P100 GPU [86]. Pipelining can also be applied between training different
DNNs where the optimizer generates the next DNN to be trained and starts the training
on GPU. Then, instead of waiting for the training to finish, it starts to generate the next
DNN [94]. The idle time of nodes (e.g., GPU, CPU) is then reduced.
DNN librairies (e.g., cuDNN, Cuda-convnet) and frameworks (e.g., Tensorflow, Caffe, Torch,
Thenao) have been developed to facilitate parallel implementation. Most DNN frameworks
are limited to a single node (e.g., GPU) and have not been designed to be efficient of large
clusters of heterogeneous nodes using MPI and CUDA [6]. TensorFlow maps the nodes of
a dataflow graph across many machines in a cluster, and within a machine across multi-
ple computational devices, including multicore CPUs, general purpose GPUs, and custom-
designed ASICs such as Tensor Processing Units (TPUs) and ARM-based platforms [1].
8 CONCLUSIONS AND PERSPECTIVES

In this article, a survey and taxonomy for the automated design of DNNs has been presented. A
unified way to describe the optimization algorithms allowed to focus on common and important
search components for all AutoDNN approaches. We have also extended this unifying view to
important optimization methodologies dealing with surrogate-based, multi-objective, and parallel
optimization. Most of the proposed AutoDNN approaches have been applied to image classifica-
tion. The proposed survey and taxonomy can help to extend the proposed taxonomy to other less
explored applications in computer vision (e.g., image restoration, semantic segmentation), NLP
(e.g., langage translation) and Industry 4.0 (e.g., predictive maintenance). It can also be reused for
other types of deep-learning architectures such as spiking neural networks.
An important issue is the definition of efficient and effective encodings, objective function(s)
and constraints. From a landscape analysis using measures such as FDC (i.e., fitness-distance
correlation) [96], and autocorrelation (i.e., autocorrelation of the accuracies of visited DNNs
in a random walk) [171], we can extract some knowledge for designing and understanding
the behavior of optimization algorithms. Designing multi-fidelity surrogates for variable space
mixed optimization problems represents an important research issue. The AutoDNN problem
is intrinsically multi-objective. To our knowledge there is no work dealing with interactive
multi-objective design of DNNs, in which there is a progressive interaction between the designer
and the optimizer. Indeed, one can use his knowledge in helping the optimizer to converge
towards interesting design subspaces [204].
HPC is evolving toward Exascale supercomputers composed of millions of cores provided in
heterogeneous devices mainly multi-core processors with various architectures. To our knowl-
edge there is no work using in conjunction the three hierarchical parallel models (i.e., algorithm,
iteration and solutin levels) introduced in this article. The massively parallel implementation of
the three hierarchical parallel models on Exascale supercomputers is an interesting challenge to
enhance efficiency and effectiveness [183]. Highly energy-efficient hardware accelerators are re-
quired for a broad spectrum of challenging AI applications [206]. Future works also need to assess
the performance benefits according to the energy overheads from the design of efficient DNN
algorithms to the design of efficient DNN processors [34].
The coupling of software frameworks dealing with optimization and deep learning is an im-
portant issue for the future. This enables us to reduce the complexity of developing optimization
approaches for new AutoDNN problems and makes them increasingly popular. Finally, some ef-
forts must be done in the definition of performance evaluation methodologies for the comparison
of different AutoDNN methodologies. Particularly, we notice the lack of information needed to
exactly reproduce the published results.
REFERENCES
[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, and M. Isard. 2016.
Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium. 265–283.
[2] K. Ahmed and L. Torresani. 2018. Maskconnect: Connectivity learning by gradient descent. In Proceedings of the
European Conference on Computer Vision (ECCV’18). 349–365.
[3] A. Ashok, N. Rhinehart, F. Beainy, and K. M. Kitani. 2018. N2N learning: Network to network compression via
policy gradient reinforcement learning. In Proceedings of the 6th International Conference on Learning Representations
(ICLR’18).
[4] F. Assunção, N. Lourenço, P. Machado, and B. Ribeiro. 2018. Evolving the topology of large-scale deep neural net-
works. In Proceedings of the 21st European Conference on Genetic Programming (EuroGP’18), Vol. 10781. 19–34.
[5] F. Assunçao, N. Lourenço, P. Machado, and B. Ribeiro. 2019. DENSER: Deep evolutionary network structured rep-
resentation. Genet. Program. Evolv. Mach. 20, 1 (2019), 5–35.
[6] A. Awan, K. Hamidouche, J. Hashmi, and D. Panda. 2017. S-caffe: Co-designing MPI runtimes and caffe for scalable
deep learning on modern GPU clusters. In Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming. 193–205.
[7] B. Baker, O. Gupta, N. Naik, and R. Raskar. 2017. Designing neural network architectures using reinforcement learn-
ing. In Proceedings of the 5th International Conference on Learning Representations (ICLR’17).
[8] B. Baker, O. Gupta, R. Raskar, and N. Naik. 2018. Accelerating neural architecture search using performance predic-
tion. In Proceedings of the 6th International Conference on Learning Representations (ICLR’18).
[9] Z. Bao, Q. Luo, and W. Zhang. 2017. An implementation and improvement of convolutional neural networks on HSA
platform. In Proceedings of the International Conference of Pioneering Computer Scientists, Engineers and Educators.
594–604.
[10] T. Bartz-Beielstein, B. Filipic, P. Korosec, and E.-G. Talbi (Eds.). 2020. High-performance Simulation-based Optimiza-
tion. Studies in Computational Intelligence, Vol. 833. Springer.
[11] J. Bayer, D. Wierstra, J. Togelius, and J. Schmidhuber. 2009. Evolving memory cell structures for sequence learning.
In Proceedings of the 19th International Conference Artificial Neural Networks (ICANN’09), Vol. 5769. 755–764.
[12] T. Ben-Nun and T. Hoefler. 2019. Demystifying parallel and distributed deep learning: An in-depth concurrency
analysis. ACM Comput. Surv. 52, 4 (2019), 65:1–65:43.
34:30 E.-G. Talbi
[13] G. Bender, P-J Kindermans, W. Zoph, V. Vasudevan, and Q. V. Le. 2018. Understanding and simplifying one-shot
architecture search. In Proceedings of the 35th International Conference on Machine Learning (ICML’18). 549–558.
[14] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl. 2011. Algorithms for hyper-parameter optimization. In Proceedings
of the 25th Annual Conference on Neural Information Processing Systems. 2546–2554.
[15] James Bergstra and Yoshua Bengio. 2012. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13
(Feb. 2012), 281–305.
[16] J. Bergstra, D. Yamins, and D. D. Cox. 2013. Making a science of model search: Hyperparameter optimization in
hundreds of dimensions for vision architectures. In International Conference on Machine Learning PMLR. 115–123.
[17] H. Bilen and A. Vedaldi. 2017. Universal representations: The missing link between faces, text, planktons, and cat
breeds. CoRR abs/1701.07275.
[18] B. Bischl, O. Mersmann, H. Trautmann, and C. Weihs. 2012. Resampling methods for meta-model validation with
recommendations for evolutionary computation. Evolution. Comput. 20, 2 (2012), 249–275.
[19] D. Blalock, J. G. Ortiz, J. Frankle, and J. Guttag. 2020. What is the state of neural network pruning? Retrieved from
https://Arxiv:2003.03033 (2020).
[20] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. 2015. Weight uncertainty in neural network. In Proceed-
ings of the 32nd International Conference on Machine Learning (ICML’15), Vol. 37. 1613–1622.
[21] A. Brock, T. Lim, J. M. Ritchie, and N. Weston. 2018. Smash: One-shot model architecture search through hypernet-
works. In Proceedings of the 6th International Conference on Learning Representations (ICLR’18).
[22] G. Brown, J. Wyatt, R. Harris, and X. Yao. 2005. Diversity creation methods: A survey and categorisation. Info. Fusion
6, 1 (2005), 5–20.
[23] E. Byla and W. Pang. 2019. DeepSwarm: Optimising convolutional neural networks using swarm intelligence. In
Advances in Computational Intelligence Systems. 119–130.
[24] E. Cai, D-C. Juan, D. Stamoulis, and D. Marculescu. 2017. Neuralpower: Predict and deploy energy-efficient convo-
lutional neural networks. Retrieved from https://Arxiv:1710.05420.
[25] H. Cai, T. Chen, W. Zhang, Y. Yu, and J. Wang. 2018. Efficient architecture search by network transformation. In
Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI’18). 2787–2794.
[26] H. Cai, J. Yang, W. Zhang, S. Han, and Y. Yu. 2018. Path-level network transformation for efficient architecture search.
In Proceedings of the International Conference on Machine Learning (ICML’18). 677–686.
[27] A. Camero, H. Wang, E. Alba, and T. Bäck. 2020. Bayesian neural architecture search using a training-free perfor-
mance metric. Retrieved from https://Arxiv:2001.10726.
[28] S. Cao, X. Wang, and K. M. Kitani. 2019. Learnable embedding space for efficient neural architecture compression.
In Proceedings of the 7th International Conference on Learning Representations (ICLR’19).
[29] F. P. Paolo Casale, J. Gordon, and N. Fusi. 2019. Probabilistic neural architecture search. CoRR abs/1902.05116.
[30] A. Chandra and Xin Yao. 2006. Ensemble learning using multi-objective evolutionary algorithms. J. Math. Model.
Algor. 5, 4 (2006), 417–445.
[31] C. Chen, F. Tung, N. Vedula, and G. Mori. 2018. Constraint-aware deep neural network compression. In Proceedings
of the European Conference on Computer Vision (ECCV’18). 400–415.
[32] L-C. Chen, M. Collins, Y. Zhu, G. Papandreou, B. Zoph, F. Schroff, H. Adam, and J. Shlens. 2018. Searching for
efficient multi-scale architectures for dense image prediction. In Proceedings of the Conference on Neural Information
Processing Systems (NIPS’18). 8699–8710.
[33] T. Chen, I. J. Goodfellow, and J. Shlens. 2016. Net2Net: Accelerating learning via knowledge transfer. In Proceedings
of the 4th International Conference on Learning Representations (ICLR’16).
[34] Y. Chen, T-J. Yang, J. Emer, and V. Sze. 2018. Understanding the limitations of existing energy-efficient design ap-
proaches for deep neural networks. Energy 2, 1 (2018).
[35] Y. Chen, K. Zhu, L. Zhu, X. He, P. Ghamisi, and J. A. Benediktsson. 2019. Automatic design of convolutional neural
network for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 57, 9 (2019), 7048–7066.
[36] P. Chrabaszcz, I. Loshchilov, and F. Hutter. 2017. A downsampled variant of imagenet as an alternative to the cifar
datasets. Retrieved from https://Arxiv:1707.08819.
[37] A. Coates, B. Huval, T. Wang, D. J. Wu, B. Catanzaro, and A. Y. Ng. 2013. Deep learning with COTS HPC systems.
In Proceedings of the 30th International Conference on Machine Learning (ICML’13). 1337–1345.
[38] M. D. Collins and M. Kohli. 2014. Memory bounded deep convolutional networks. CoRR abs/1412.1442.
[39] V. Costa, N. Lourenço, and P. Machado. [n.d.]. Coevolution of generative adversarial networks. In Evoapplications
International Conference on Applications of Evolutionary Computation, Vol. 11454. 473–487.
[40] M. Courbariaux, Y. Bengio, and J-P. David. 2015. BinaryConnect: Training deep neural networks with binary weights
during propagations. In Advances in Neural Information Processing Systems. MIT Press, 3123–3131.
[41] H. Cui, H. Zhang, G. R. Ganger, P. B. Gibbons, and E. P. Xing. 2016. GeePS: Scalable deep learning on distributed
GPUs with a GPU-specialized parameter server. In Proceedings of the European Conference on Computer Systems
(EuroSys’16). 1–16.
[42] A. Darwish, A. E. Hassanien, and S. Das. 2020. A survey of swarm and evolutionary computing approaches for deep
learning. Artific. Intell. Rev. 53, 3 (2020), 1767–1812.
[43] A. Das, M. Hasegawa-Johnson, and K. Veselý. 2017. Deep auto-encoder based multi-task learning using probabilistic
transcriptions. In Proceedings of the 18th Conference of the International Speech Communication Association. 2073–
2077.
[44] D. Dasgupta and Z. Michalewicz. 2013. Evolutionary Algorithms in Engineering Applications. Springer.
[45] J. Dean et al. 2012. Large-scale distributed deep networks. In Advances in Neural Information Processing Systems.
MIT Press, 1223–1231.
[46] T. Desell, S. Clachar, J. Higgins, and B. Wild. 2015. Evolving deep recurrent neural networks using ant colony op-
timization. In Proceedings of the 15th European Conference Evolutionary Computation in Combinatorial Optimization
(EvoCOP’15). 86–98.
[47] T. Dokeroglu and E. Sevinc. 2019. Evolutionary parallel extreme learning machines for the data classification prob-
lem. Comput. Industr. Eng. 130 (2019), 237–249.
[48] T. Domhan, J. T. Springenberg, and F. Hutter. 2015. Speeding up automatic hyperparameter optimization of deep
neural networks by extrapolation of learning curves. In Proceedings of the International Joint Conference on Artificial
Intelligence.
[49] J.-D. Dong, A.-C. Cheng, D.-H. Juan, W. Wei, and M. Sun. 2018. DPP-Net: Device-aware progressive search for
Pareto-optimal neural architectures. In Proceedings of the 15th European Conference on Computer Vision (ECCV’18).
540–555.
[50] M. Dorigo, M. Birattari, and T. Stutzle. 2006. Ant colony optimization. IEEE Comput. Intell. Mag. 1, 4 (2006), 28–39.
[51] N. Dryden, N. Maruyama, T. Moon, T. Benson, A. Yoo, M. Snir, and B. Van Essen. 2018. Aluminum: An Asynchronous,
GPU-aware Communication Library Optimized for Large-scale Training of Deep Neural Networks on HPC Systems.
Technical Report, Lawrence Livermore National Laboratory, Livermore, CA.
[52] A. ElSaid, F. El Jamiy, J. Higgins, B. Wild, and T. Desell. 2018. Optimizing long short-term memory recurrent neural
networks using ant colony optimization to predict turbine engine vibration. Appl. Soft Comput. 73 (2018), 969–991.
[53] T. Elsken, J. Hendrik, and F. Hutter. 2018. Efficient multi-objective neural architecture search via lamarckian evolu-
tion. Retrieved from https://Arxiv:1804.09081.
[54] T. Elsken, J-H. Metzen, and F. Hutter. 2018. Simple and efficient architecture search for convolutional neural net-
works. In Proceedings of the 6th International Conference on Learning Representations (ICLR’18).
[55] T. Elsken, J. H. Metzen, and F. Hutter. 2019. Neural architecture search: A survey. J. Mach. Learn. Res. 20 (2019),
55:1–55:21.
[56] R. S. Engelmore and A. Morgan. 1988. Blackboard Systems. Addison-Wesley.
[57] S. Falkner, A. Klein, and F. Hutter. 2018. BOHB: Robust and efficient hyperparameter optimization at scale. In Pro-
ceedings of the 35th International Conference on Machine Learning (ICML’18). 1436–1445.
[58] Xin Feng, Youni Jiang, Xuejiao Yang, Ming Du, and Xin Li. 2019. Computer vision algorithms and hardware imple-
mentations: A survey. Integr. VLSI J. 69 (2019), 309–320.
[59] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. Rusu, A. Pritzel, and D. Wierstra. 2017. Pathnet: Evolution
channels gradient descent in super neural networks. Retrieved from https://Arxiv:1701.08734.
[60] M. Feurer and F. Hutter. 2019. Hyperparameter optimization. In Automated Machine Learning. Springer, 3–33.
[61] Ben Fielding and Li Zhang. 2018. Evolving image classification architectures with enhanced particle swarm optimi-
sation. IEEE Access 6 (2018), 68560–68575.
[62] S. Fong, S. Deb, and X-S. Yang. 2018. How meta-heuristic algorithms contribute to deep learning in the hype of big
data analytics. In Progress in Intelligent Computing Techniques. Springer, 3–25.
[63] A. Gaier and D. Ha. 2019. Weight agnostic neural networks. In Proceedings of the Conference on Advances in Neural
Information Processing Systems (NeurIPS’19). 5365–5379.
[64] X. Glorot and Y. Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Pro-
ceedings of the 13th International Conference on Artificial Intelligence and Statistics. 249–256.
[65] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley. 2017. Google vizier: A service for black-box
optimization. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
1487–1495.
[66] T. Gong, T. Lee, C. Stephenson, V. Renduchintala, S. Padhy, A. Ndirango, G. Keskin, and O. Elibol. 2019. A comparison
of loss weighting strategies for multi task learning in deep neural networks. IEEE Access 7 (2019), 141627–141632.
[67] I. Goodfellow, Y. Bengio, and A. Courville. 2016. Deep Learning. MIT Press.
[68] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. 2014.
Generative adversarial nets. In Proceedings of the Conference on Advances in Neural Information Processing Systems
(NIPS’14). 2672–2680.
34:32 E.-G. Talbi
[69] P. Goyal and other. 2017. Accurate, large minibatch SGD: Training imagenet in 1 hour. Retrieved from https://Arxiv:
1706.02677.
[70] F. Gruau. 1993. Genetic synthesis of modular neural networks. In Proceedings of the 5th International Conference on
Genetic Algorithms. 318–325.
[71] J. Gu, M. Zhu, Z. Zhou, F. Zhang, Z. Lin, Q. Zhang, and M. Breternitz. 2014. Implementation and evaluation of deep
neural networks (DNN) on mainstream heterogeneous systems. In Proceedings of the 5th Asia-Pacific Workshop on
Systems. 1–7.
[72] S. Han, J. Pool, J. Tran, and W. J. Dally. 2015. Learning both weights and connections for efficient neural network.
In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1135–1143.
[73] X. Han, D. Zhou, S. Wang, and S. Kimura. 2016. CNN-MERP: An FPGA-based memory-efficient reconfigurable pro-
cessor for forward and backward propagation of convolutional neural networks. In Proceedings of the IEEE Interna-
tional Conference on Computer Design (ICCD’16). 320–327.
[74] A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger, and P. Gibbons. 2018. Pipedream:
Fast and efficient pipeline parallel dnn training. Retrieved from https://Arxiv:1806.03377.
[75] K. He and J. Sun. 2015. Convolutional neural networks at constrained time cost. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition (CVPR’15).
[76] K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 770–778.
[77] J. M. Hernández-Lobato, M. A. Gelbart, B. Reagen, R. Adolf, D. Hernández-Lobato, P. N. Whatmough, D. Brooks,
G-Y. Wei, and R. P. Adams. 2016. Designing neural network hardware accelerators with decoupled objective evalu-
ations. In Proceedings of the NIPS Workshop on Bayesian Optimization.
[78] J. M. Hernández-Lobato, M. W. Hoffman, and Z. Ghahramani. 2014. Predictive entropy search for efficient global
optimization of black-box functions. In Advances in Neural Information Processing Systems. MIT Press, 918–926.
[79] G. Hinton, S. Osindero, and Y.-W. Teh. 2006. A fast learning algorithm for deep belief nets. Neural Comput. 18, 7
(2006), 1527–1554.
[80] G. E. Hinton. 2012. A practical guide to training restricted boltzmann machines. In Neural Networks: Tricks of the
Trade, G. Montavon, G. B. Orr, and K-R Müller (Eds.). Vol. 7700. 599–619.
[81] G. E. Hinton, S. Osindero, and Y. W. Teh. 2006. A fast learning algorithm for deep belief nets. Neural Comput. 18, 7
(2006), 1527–1554.
[82] T. Hinz, N. Navarro-Guerrero, S. Magg, and S. Wermter. 2018. Speeding up the hyperparameter optimization of deep
convolutional neural networks. Int. J. Comput. Intell. Appl. 17, 02 (2018).
[83] S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735–1780.
[84] C. H. Hsu, S. H. Chang, J. H. Liang, H. P. Chou, C. H. Liu, S. C. Chang, J. Y. Pan, Y. T. Chen, W. Wei, and D. C. Juan.
2018. Monas: Multi-objective neural architecture search using reinforcement learning. Retrieved from https://Arxiv:
1806.10332.
[85] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. 2017. Densely connected convolutional networks. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4700–4708.
[86] Y. Huang, Y. Cheng, A. Bapna, O. Firat, D. Chen, M. Chen, H. Lee, J. Ngiam, Q. V. Le, and Y. Wu. 2019. Gpipe: Efficient
training of giant neural networks using pipeline parallelism. In Proceedings of the Conference on Neural Information
Processing Systems (NIPS’19). 103–112.
[87] F. Hutter, L. Kotthoff, and J. Vanschoren. 2019. Automated Machine Learning. Springer.
[88] Md I. M. Shahriar, J. Su, L. Kotthoff, and P. Jamshidi. 2020. FlexiBO: Cost-aware multi-objective optimization of deep
neural networks. arXiv preprint arXiv:2001.06588.
[89] F. Iandola, M. Moskewicz, K. Ashraf, and K. Keutzer. 2016. Firecaffe: Near-linear acceleration of deep neural network
training on compute clusters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2592–
2600.
[90] I. Ilievski, T. Akhtar, J. Feng, and C. A. Shoemaker. 2017. Efficient hyperparameter optimization for deep learning
algorithms using deterministic RBF surrogates. In Proceedings of the AAAI Conference on Artificial Intelligence.
[91] Y. Jaafra, J.-L. Laurent, A. Deruyver, and M. S. Naceur. 2019. Reinforcement learning for neural architecture search:
A review. Image Vision Comput. 89 (2019), 57–66.
[92] R. Jenatton, C. Archambeau, J. González, and M. W. Seeger. 2017. Bayesian optimization with tree-structured de-
pendencies. In Proceedings of the 34th International Conference on Machine Learning (ICML’17). 1655–1664.
[93] J. Jiang, F. Han, Q. Ling, J. Wang, T. Li, and H. Han. 2020. Efficient network architecture search via multiobjective
particle swarm optimization based on decomposition. Neural Netw. 123 (2020), 305–316.
[94] H. Jin, Q. Song, and X. Hu. 2019. Auto-keras: An efficient neural architecture search system. In Proceedings of the
25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1946–1956.
[95] Y. Jin. 2011. Surrogate-assisted evolutionary computation: Recent advances and future challenges. Swarm Evolution.
Comput. 1 (06 2011), 61–70.
[96] Terry Jones et al. 1995. Evolutionary Algorithms, Fitness Landscapes and Search. Ph.D. Dissertation. Citeseer.
[97] R. Józefowicz, W. Zaremba, and I. Sutskever. 2015. An empirical exploration of recurrent network architectures. In
Proceedings of the 32nd International Conference on Machine Learning (ICML’15). 2342–2350.
[98] F. E. Junior and G. Yen. 2019. Particle swarm optimization of deep neural networks architectures for image classifi-
cation. Swarm Evolution. Comput. 49 (2019), 62–74.
[99] L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit. 2017. One model to learn them
all. CoRR abs/1706.05137.
[100] K. Kandasamy, W. Neiswanger, J. Schneider, B. Poczos, and E. P. Xing. 2018. Neural architecture search with bayesian
optimisation and optimal transport. In Advances in Neural Information Processing Systems. MIT Press, 2016–2025.
[101] A. Kendall, Y. Gal, and R. Cipolla. 2018. Multi-task learning using uncertainty to weigh losses for scene geometry and
semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18). 7482–7491.
[102] Y. H. Kim, B. Reddy, S. Yun, and C. Seo. 2017. Nemo: Neuro-evolution with multiobjective optimization of deep
neural network for speed and accuracy. In 34th International Conference on Machine Learning ICML. 1–8.
[103] A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter. 2017. Fast bayesian optimization of machine learning hy-
perparameters on large datasets. In Proceedings of the International Conference on Artificial Intelligence and Statistics.
528–536.
[104] A. Krizhevsky. 2014. One weird trick for parallelizing convolutional neural networks. CoRR abs/1404.5997.
[105] A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. ImageNet classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems. MIT Press, 1106–1114.
[106] Pedro Larranaga. 2002. A review on estimation of distribution algorithms. In Estimation of Distribution Algorithms.
Springer, 57–100.
[107] L. Li and T. Ameet. 2019. Random search and reproducibility for neural architecture search. Retrieved from https://
Arxiv:1902.07638.
[108] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. 2017. Hyperband: A novel bandit-based approach
to hyperparameter optimization. J. Mach. Learn. Res. 18, 1 (2017), 6765–6816.
[109] L. Li and A. Talwalkar. 2019. Random search and reproducibility for neural architecture search. In Proceedings of the
Conference on Uncertainty in Artificial Intelligence (UAI’19). 129.
[110] Z. Li, T. Xi, J. Deng, G. Zhang, S. Wen, and R. He. 2020. GP-NAS: Gaussian process based neural architecture search.
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11933–11942.
[111] X. Lin, H.-L. Zhen, Z. Li, Q.-F. Zhang, and S. Kwong. 2019. Pareto multi-task learning. In Proceedings of the Conference
on Advances in Neural Information Processing Systems (NeurIPS’19). 12037–12047.
[112] C. Liu, Z. Barret, N. Maxim, S. Jonathon, H. Wei, L. Li-Jia, F.-F. Li, Y. Alan, H. Jonathan, and M. Kevin. 2018. Progressive
neural architecture search. In Proceedings of the European Conference on Computer Vision (ECCV’18). 19–34, 2018.
[113] C. Liu, L.-C. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, and F-F. Li. 2019. Auto-deeplab: Hierarchical neural
architecture search for semantic image segmentation. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition (CVPR’19). 82–92.
[114] C. Liu, L.-C. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille, and L. Fei-Fei. 2019. Auto-deeplab: Hierarchical neural
architecture search for semantic image segmentation. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. 82–92.
[115] C. Liu, Z. Zhang, and D. Wang. [n.d.]. Pruning deep neural networks by optimal brain damage. In Proceedings of the
15th Conference International Speech Communication (INTERSPEECH). 1092–1095.
[116] H. Liu, K. Simonyan, O. Vinyals, C. Fernando, and K. Kavukcuoglu. 2017. Hierarchical representations for efficient
architecture search. Retrieved from https://Arxiv:1711.00436.
[117] H. Liu, K. Simonyan, and Y. Yang. 2018. Darts: Differentiable architecture search. Retrieved from https://Arxiv:
1806.09055.
[118] J. Liu, M. Gong, Q. Miao, X. Wang, and H. Li. 2018. Structure learning for deep neural networks based on multiob-
jective optimization. IEEE Trans. Neural Netw. Learn. Syst. 29, 6 (2018), 2450–2463.
[119] S. Liu, E. Johns, and A. J. Davison. 2019. End-to-end multi-task learning with attention. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition (CVPR’19). 1871–1880.
[120] A. Lokhmotov, N. Chunosov, F. Vella, and G. Fursin. 2018. Multi-objective autotuning of mobileNets across the full
software/hardware stack. In Proceedings of the International Conference on Architectural Support for Programming
Languages and Operating Systems (ReQuEST@ASPLOS’18). 6–16.
[121] M. Loni, S. Sinaei, A. Zoljodi, M. Daneshtalab, and M. Sjödin. 2020. DeepMaker: A multi-objective optimization
framework for deep neural networks in embedded systems. Microprocess. Microsyst. 73 (2020), 102989.
34:34 E.-G. Talbi
[122] P. R. Lorenzo and J. Nalepa. 2018. Memetic evolution of deep neural networks. In Proceedings of the Genetic and
Evolutionary Computation Conference (GECCO’18). 505–512.
[123] R. R. Lorenzo, J. Nalepa, L. S. Ramos, and J. R. Pastor. 2017. Hyperparameter selection in deep neural networks
using parallel particle swarm optimization. In Proceedings of the Genetic and Evolutionary Computation Conference.
1864–1871.
[124] I. Loshchilov and F. Hutter. 2016. CMA-ES for hyperparameter optimization of deep neural networks. CoRR
abs/1604.07269.
[125] Z. Lu, I. Whalen, V. Boddeti, Y. D. Dhebar, K. Deb, E. D. Goodman, and W. Banzhaf. 2018. NSGA-NET: A multi-
objective genetic algorithm for neural architecture search. CoRR abs/1810.03522.
[126] G. Luo. 2016. A review of automatic selection methods for machine learning algorithms and hyper-parameter values.
Netw. Model. Anal. Health Inform. Bioinform. 5, 1 (2016), 18.
[127] R. Luo, T. Fei, Q. Tao Qin, C. Enhong, and L. Tie-Yan. 2018. Neural architecture optimization. In Advances in Neural
Information Processing Systems. MIT Press, 7816–7827.
[128] L. Ma, J. Cui, and B. Yang. 2019. Deep neural architecture search with deep graph bayesian optimization. In Proceed-
ings of the International Conference on Web Intelligence (WI’19). 500–507.
[129] G. Marquet, B. Derbel, A. Liefooghe, and E-G. Talbi. 2014. Shake them all! - rethinking selection and replacement in
MOEA/D. In Proceedings of the International Conference on Parallel Problem Solving from Nature (PPSN’14). Springer,
641–651.
[130] D. Martinez, W. Brewer, G. Behm, A. Strelzoff, A. Wilson, and D. Wade. 2018. Deep learning evolutionary optimiza-
tion for regression of rotorcraft vibrational spectra. In Proceedings of the IEEE/ACM Conference on Machine Learning
in HPC Environments. 57–66.
[131] S. Masanori, S. Shinichi, and N. Tomoharu. 2018. A genetic programming approach to designing convolutional neural
network architectures. In Proceedings of the International Joint Conferences on Artificial Intelligence (IJCAI’18).
[132] N. M. Masood and G. M. Khan. 2018. Signal reconstruction using evolvable recurrent neural networks. In Proceedings
of the International Conference on Intelligent Data Engineering and Automated Learning. 594–602.
[133] H. Mendoza, A. Klein, M. Feurer, J. T. Springenberg, and F. Hutter. 2016. Towards automatically tuned neural net-
works. In Proceedings of the Workshop on Automatic Machine Learning. 58–65.
[134] K. Miettinen. 1999. Nonlinear Multiobjective Optimization. Springer.
[135] R. Miikkulainen, J. Z. Liang, E. Meyerson, A. Rawal, D. Fink, O. Francon, B. Raju, H. Shahrzad, A. Navruzyan,
N. Duffy, and B. Hodjat. 2017. Evolving deep neural networks. CoRR abs/1703.00548.
[136] S. Mittal. 2018. A survey of FPGA-based accelerators for convolutional neural networks. Neural Comput. Appl. 32, 4
(2018), 1–31.
[137] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz. 2016. Pruning convolutional neural networks for resource
efficient transfer learning. CoRR abs/1611.06440.
[138] R. Negrinho and G. Gordon. 2017. Deeparchitect: Automatically designing and training deep architectures. Retrieved
from https://Arxiv:1704.08792.
[139] K. Ni, R. Pearce, K. Boakye, B. Van Essen, D. Borth, B. Chen, and E. Wang. 2015. Large-scale deep learning on the
YFCC100M dataset. Retrieved from https://Arxiv:1502.03409.
[140] A. E. Olsson. 2010. Particle Swarm Optimization: Theory, Techniques and Applications. Nova Science Publishers.
[141] T. Paine, H. Jin, J. Yang, Z. Lin, and T. S. Huang. 2014. GPU asynchronous stochastic gradient descent to speed up
neural network training. In Proceedings of the 2nd International Conference on Learning Representations (ICLR’14).
[142] R. Parekh, J. Yang, and V. G. Honavar. 2000. Constructive neural-network learning algorithms for pattern classifica-
tion. IEEE Trans. Neural Netw. Learn. Syst. 11, 2 (2000), 436–451.
[143] E. Park, D. Kim, S. Kim, Y-D Kim, G. Kim, S. Yoon, and S. Yoo. 2015. Big/little deep neural network for ultra low
power inference. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis
(CODES+ISSS’15), G. Nicolescu and A. Gerstlauer (Eds.). 124–132.
[144] J. Pelamatti, L. Brevault, M. Balesdent, E-G. Talbi, and Y. Guerin. 2017. How to deal with mixed-variable optimiza-
tion problems: An overview of algorithms and formulations. In Proceedings of the World Congress of Structural and
Multidisciplinary Optimization (WCSMO’17). 64–82.
[145] J. Pelamatti, L. Brevault, M. Balesdent, E-G. Talbi, and Y. Guerin. 2021. Bayesian optimization of variable-size design
space problems. Optimiz. Eng. J. 22 (2021), 387–447.
[146] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. [n.d.]. Efficient neural architecture search via parameter sharing.
In Proceedings of the International Conference on Machine Learning (ICML), Vol. 80. 4092–4101.
[147] H. Qi, E. R. Sparks, and A. Talwalkar. 2017. Paleo: A performance model for deep neural networks. In Proceedings of
the 5th International Conference on Learning Representations (ICLR’17).
[148] A. Rawal and R. Miikkulainen. 2018. From nodes to networks: Evolving recurrent neural networks. Retrieved from
https://Arxiv:1803.04439.
[149] W. Rawat and Z. Wang. 2019. Hybrid stochastic GA-Bayesian search for deep convolutional neural network model
selection. J. Univ. Comput. Sci. 25, 6 (2019), 647–666.
[150] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. 2019. Aging evolution for image classifier architecture search. In
Proceedings of the AAAI Conference on Artificial Intelligence.
[151] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le. 2019. Regularized evolution for image classifier architecture search. In
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4780–4789.
[152] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le, and A. Kurakin. 2017. Large-scale evolution of
image classifiers. In Proceedings of the International Conference on Machine Learning. 2902–2911.
[153] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. 2015. FitNets: Hints for thin deep nets. In
Proceedings of the International Conference on Learning Representations (ICLR’15).
[154] B. B. Rouhani, A. Mirhoseini, and F. Koushanfar. [n.d.]. DeLight: Adding energy dimension to deep neural networks.
In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED’16). 112–117.
[155] Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. Retrieved from https://Arxiv:
1706.05098 (2017).
[156] S. Saxena and J. Verbeek. 2016. Convolutional neural fabrics. In Advances in Neural Information Processing Systems.
MIT Press, 4053–4061.
[157] C. Sciuto, K. Yu, M. Jaggi, C. Musat, and M. Salzmann. 2019. Evaluating the search phase of neural architecture
search. Retrieved from https://Arxiv:1902.08142.
[158] C. Sciuto, K. Yu, M. Jaggi, C. Musat, and M. Salzmann. 2019. Evaluating the search phase of neural architecture
search. CoRR abs/1902.08142.
[159] A. Sener and V. Koltun. 2018. Multi-task learning as multi-objective optimization. In Advances in Neural Information
Processing Systems. MIT Press, 527–538.
[160] A. Sergeev and M. Del Balso. 2018. Horovod: Fast and easy distributed deep learning in tensorflow. Retrieved from
[161] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas. 2015. Taking the human out of the loop: A review
of bayesian optimization. Proc. IEEE 104, 1 (2015), 148–175.
[162] M. Zhang, H. Li, and S. Pan. 2020. Differentiable neural architecture search in equivalent space with exploration
enhancement. Advances in Neural Information Processing Systems, vol. 33.
[163] R. Shin, C. Packer, and D. Song. 2018. Differentiable neural network architecture search. In Proceedings of the 6th
International Conference on Learning Representations (ICLR’18).
[164] T. Shinozaki and S. Watanabe. 2015. Structure discovery of deep neural network based on evolutionary algorithms.
In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’15). 4979–4983.
[165] K. Simonyan and A. Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Pro-
ceedings of the International Conference on Learning Representations (ICLR’15).
[166] C. Smith and Y. Jin. 2014. Evolutionary multi-objective generation of recurrent neural network ensembles for time
series prediction. Neurocomputing 143 (2014), 302–311.
[167] S. C. Smithson, G. Yang, W. J. Gross, and B. H. Meyer. 2016. Neural networks designing neural networks: Multi-
objective hyper-parameter optimization. In Proceedings of the International Conference on Computer-Aided Design
(ICCAD’16). 104.
[168] J. Snoek, H. Larochelle, and R. P. Adams. 2012. Practical bayesian optimization of machine learning algorithms. In
Advances in Neural Information Processing Systems. MIT Press, 2951–2959.
[169] J. Snoek, O. Rippel, K. Swersky, R. Kiros, N. Satish, N. Sundaram, M. A. P. Prabhat, and R. P. Adams. 2015. Scal-
able bayesian optimization using deep neural networks. In Proceedings of the International Conference on Machine
Learning. 2171–2180.
[170] J. T. Springenberg, A. Klein, S. Falkner, and F. Hutter. 2016. Bayesian optimization with robust bayesian neural
networks. In Advances in Neural Information Processing Systems. MIT Press, 4134–4142.
[171] P. F. Stadler. 1996. Landscapes and their correlation functions. J. Math. Chem. 20, 1 (1996), 1–45.
[172] K. O. Stanley, J. Clune, J. Lehman, and R. Miikkulainen. 2019. Designing neural networks through neuroevolution.
Nature Mach. Intell. 1, 1 (2019), 24–35.
[173] K. O. Stanley and R. Miikkulainen. 2002. Evolving neural networks through augmenting topologies. Evolution. Com-
put. 10, 2 (2002), 99–127.
[174] I. Strumberger, E. Tuba, N. Bacanin, R. Jovanovic, and M. Tuba. 2019. Convolutional neural network architecture de-
sign by the tree growth algorithm framework. In Proceedings of the International Joint Conference on Neural Networks
(IJCNN’19). 1–8.
[175] M. Suganuma, M. Ozay, and T. Okatani. 2018. Exploiting the potential of standard convolutional autoencoders for im-
age restoration by evolutionary search. In Proceedings of the International Conference on Machine Learning (ICML’18).
4778–4787.
34:36 E.-G. Talbi
[176] Y. Sun, X. Wang, and X. Tang. 2016. Sparsifying neural network connections for face recognition. In Proceedings of
the Conference on Computer Vision and Pattern Recognition (CVPR’16). 4856–4864.
[177] Y. Sun, B. Xue, M. Zhang, and G. Yen. 2018. An experimental study on hyper-parameter optimization for stacked
auto-encoders. In Proceedings of the IEEE Congress on Evolutionary Computation (CEC’18). 1–8.
[178] Y. Sun, B. Xue, M. Zhang, and G. Yen. 2018. A particle swarm optimization-based flexible convolutional autoencoder
for image classification. IEEE Trans. Neural Netw. Learn. Syst. 30, 8 (2018), 2295–2309.
[179] K. Swersky, D. Duvenaud, J. Snoek, F. Hutter, and M. A. Osborne. 2014. Raiders of the lost architecture: Kernels for
Bayesian optimization in conditional parameter spaces. Retrieved from https://Arxiv:1409.4011.
[180] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. 2015.
Going deeper with convolutions. In Proceedings of the Conference on Computer Vision and Pattern Recognition. 1–9.
[181] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. 2016. Rethinking the inception architecture for computer
vision. In Proceedings of the Conference on Computer Vision and Pattern Recognition. 2818–2826.
[182] E.-G. Talbi. 2009. Metaheuristics: from Design to Implementation. Wiley.
[183] E.-G. Talbi. 2019. A unified view of parallel multi-objective evolutionary algorithms. J. Parallel Distrib. Comput. 133
(2019), 349–358.
[184] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le. 2019. Mnasnet: Platform-aware neural
architecture search for mobile. Computer Vision and Pattern Recognition Conference. 2820–2828.
[185] M. Tan and Q. V. Le. 2019. Efficientnet: Rethinking model scaling for convolutional neural networks. Retrieved from
[186] C. Thornton, F. Hutter, H. Hoos, and K. Leyton-Brown. 2013. Auto-WEKA: Combined selection and hyperparameter
optimization of classification algorithms. In Proceedings of the International Conference on Knowledge Discovery and
Data Mining. 847–855.
[187] A. Turner. 2015. Evolving Artificial Neural Networks Using Cartesian Genetic Programming. Ph.D. Dissertation. Uni-
versity of York.
[188] T. Veniat and L. Denoyer. 2018. Learning time/memory-efficient deep architectures with budgeted supernetworks.
In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR’18). 3492–3500.
[189] B. Wang, Y. Sun, B. Xue, and M. Zhang. 2018. Evolving deep convolutional neural networks by variable-length
particle swarm optimization for image classification. In Proceedings of the Congress on Evolutionary Computation
(CEC’18). 1–8.
[190] B. Wang, Y. Sun, B. Xue, and M. Zhang. 2018. A hybrid differential evolution approach to designing deep convo-
lutional neural networks for image classification. In Proceedings of the Australasian Joint Conference on Artificial
Intelligence. 237–250.
[191] B. Wang, Y. Sun, B. Xue, and M. Zhang. 2019. Evolving deep neural networks by multi-objective particle swarm
optimization for image classification. In Proceedings of the Genetic and Evolutionary Computation Conference. 490–
498.
[192] C. Wang, L. Gong, Q. Yu, X. Li, Y. Xie, and X. Zhou. 2016. DLAU: A scalable deep learning accelerator unit on FPGA.
IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 36, 3 (2016), 513–517.
[193] C. Wang, C. Xu, X. Yao, and D. Tao. 2019. Evolutionary generative adversarial networks. IEEE Trans. Evolution.
Comput. 23, 6 (2019), 921–934.
[194] L. Wang, Y. Zhao, Y. Jinnai, Y. Tian, and R. Fonseca. 2019. AlphaX: Exploring neural architectures with deep neural
networks and monte carlo tree search. CoRR abs/1903.11059.
[195] T. Wei, C. Wang, Y. Rui, and C. W. Chen. 2016. Network morphism. In Proceedings of the International Conference on
Machine Learning. 564–572.
[196] K. Weiss, T. Khoshgoftaar, and D. Wang. 2016. A survey of transfer learning. J. Big data 3, 1 (2016), 9.
[197] Y. Weng, T. Zhou, Y. Li, and X. Qiu. 2019. NAS-Unet: Neural architecture search for medical image segmentation.
IEEE Access 7 (2019), 44247–44257.
[198] C. White, W. Neiswanger, and Y. Savani. 2019. BANANAS: Bayesian optimization with neural architectures for
neural architecture search. Retrieved from https://Arxiv:1910.11858.
[199] J. Wilson, F. Hutter, and M. Deisenroth. 2018. Maximizing acquisition functions for bayesian optimization. In Ad-
vances in Neural Information Processing Systems. MIT Press, 9884–9895.
[200] M. Wistuba, A. Rawat, and T. Pedapati. 2019. A survey on neural architecture search. CoRR abs/1905.01392.
[201] J. Wu, S. Toscano-Palmerin, P. Frazier, and A. G. Wilson. 2020. Practical multi-fidelity bayesian optimization for
hyperparameter tuning. In Uncertainty in Artificial Intelligence. North-Holland, 788–798.
[202] L. Xie and A. Yuille. 2017. Genetic CNN. In Proceedings of the IEEE International Conference on Computer Vision.
1379–1388.
[203] S. Xie, H. Zheng, C. Liu, and L. Lin. 2019. SNAS: Stochastic neural architecture search. In Proceedings of the Interna-
tional Conference on Learning Representations (ICLR’19).
[204] B. Xin, L. Chen, J. Chen, H. Ishibuchi, K. Hirota, and B. Liu. 2018. Interactive multiobjective optimization: A review
of the state-of-the-art. IEEE Access 6 (2018), 41256–41279.
[205] O. Yadan, K. Adams, Y. Taigman, and M. Ranzato. 2014. Multi-GPU training of convnets. In Proceedings of the Inter-
national Conference on Learning Representations (ICLR’14).
[206] T.-J. Yang, Y.-H. Chen, and V. Sze. 2017. Designing energy-efficient convolutional neural networks using energy-
aware pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5687–5695.
[207] C. Ying, A. Klein, E. Real, E. Christiansen, K. Murphy, and F. Hutter. 2019. Nas-bench-101: Towards reproducible
neural architecture search. Retrieved from https://Arxiv:1902.09635.
[208] S. R. Young, D. C. Rose, T. P. Karnowski, S.-H. Lim, and R. M. Patton. 2015. Optimizing deep learning hyper-
parameters through an evolutionary algorithm. In Proceedings of the Workshop on Machine Learning in HPC En-
vironments. 41–45.
[209] A. Zela, A. Klein, S. Falkner, and F. Hutter. 2018. Towards automated deep learning: Efficient joint neural architecture
and hyperparameter search. Retrieved from https://Arxiv:1807.06906.
[210] Q. Zhang and H. Li. 2007. MOEA/D: A multiobjective evolutionary algorithm based on decomposition. IEEE Trans.
Evolution. Comput. 11, 6 (2007), 712–731.
[211] X. Zhang, Z. Huang, and N. Wang. 2018. You only search once: Single shot neural architecture search via direct
sparse optimization. CoRR abs/1811.01567.
[212] Y. Zhang, C. Wang, L. Gong, Y. Lu, F. Sun, C. Xu, X. Li, and X. Zhou. 2017. A power-efficient accelerator based on
FPGAs for LSTM network. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’17).
629–630.
[213] G. Zhong, T. Li, W. Jiao, L.-N. Wang, J. Dong, and C.-L. Liu. 2020. DNA computing inspired deep networks design.
Neurocomputing 382 (2020), 140–147.
[214] Z. Zhong, J. Yan, and C.-L. Liu. 2017. Practical network blocks design with q-learning. CoRR abs/1708.05552.
[215] Z. Zhong, J. Yan, W. Wu, J. Shao, and C.-H. Liu. 2018. Practical block-wise neural network architecture generation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2423–2432.
[216] Y. Zhou, S. Arik, H. Yu, H. Liu, and G. Diamos. 2018. Resource-efficient neural architect. Retrieved from https://Arxiv:
1806.07912.
[217] B. Zhuang, C. Shen, M. Tan, L. Liu, and I. D. Reid. 2019. Structured binary neural networks for accurate image
classification and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 413–422.
[218] B. Zoph and Q. V. Le. 2017. Neural architecture search with reinforcement learning. In Proceedings of the International
Conference on Learning Representations (ICLR’17).
[219] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. 2018. Learning transferable architectures for scalable image recogni-
tion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8697–8710.
Received June 2020; revised November 2020; accepted November 2020

Talbi 2021

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Talbi 2021

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Talbi 2021

Uploaded by

Copyright:

Available Formats

Automated Design of Deep Neural Networks: A Survey and

EL-GHAZALI TALBI, University of Lille, CNRS, INRIA 34

2.1 Deep Neural Networks

2.2.2 Population-based Metaheuristics. Population-based metaheuristics (P-metaheuristics)

1 Also named transition rule, pivoting rule, and selection strategy.

4.1 Representation of DNNs

Fig. 2. Classification of the different encodings of DNNs.

Fig. 4. Skip connected and multiple edges chain-structured DNNs.

3 Also called patterns, modules, stages, segments, motifs, and cells.

4.2 Performance Estimation of DNNs

4 Also called Hypernetworks.

5 Also known as meta-model, reduced model, and approximation.

4.3 Constraint Handling

4.4 Initial Solution(s)

4.5 Variation Operators

Fig. 9. Unary variation operators at different levels of a DNN.

• Architecture: n-ary operators have been used in different optimization frameworks:

7 Also known as Bayesian Optimization.

Fig. 11. The general framework of surrogate-based optimization algorithms.

6.1 Multi-objective Single-task Learning

8 Also named Pareto approaches.

6.2 Multi-objective Multi-task Learning

Fig. 12. Parallel models of metaheuristics for AutoDNN.

7.1 Parallel Design

7.2 Parallel Implementation

Distributed-memory architectures: Clusters of CPU nodes represent the most popular

8 CONCLUSIONS AND PERSPECTIVES

Received June 2020; revised November 2020; accepted November 2020

You might also like