Most Influential Data Science Research Papers
Most Influential Data Science Research Papers
Most Influential Data Science Research Papers
These papers provide a breadth of information about data science that is generally useful and interesting from an AI
perspective.
Contents
A General and Adaptive Robust Loss Function
Group Normalization
Backprop Evolution
Recent Advances in Object Detection in the Age of Deep Convolutional Neural Networks
Jonathan T. Barron
Google Research
1
vision-oriented deep learning models (variational autoen- In the limit as α approaches negative infinity, our loss be-
coders for image synthesis and self-supervised monocular comes Welsch [21] (aka Leclerc [26]) loss:
depth estimation), replace their losses with the negative log-
1 x 2
likelihood of our general distribution, and demonstrate that lim f (x, α, c) = 1 − exp − ( /c) (7)
α→−∞ 2
allowing our distribution to automatically determine its own
robustness can improve performance without introducing With this analysis we can present our final loss function,
any additional manually-tuned hyperparameters. In Sec- which is simply f (·) with special cases for its removable
tions 3.3 and 3.4 we use our loss function to generalize singularities at α = 0 and α = 2 and its limit at α = −∞.
algorithms for the classic vision tasks of registration and 1
x/c)2
clustering, and demonstrate the performance improvement
2 ( if α = 2
1 x 2
that can be achieved by introducing robustness as a hyper-
log (
2 /c ) + 1 if α = 0
parameter that is annealed or manually tuned. ρ (x, α, c) = 1 − exp − 1 (x/c)2 if α = −∞
2 α/2
2
1. Loss Function
|α−2| (x/c)
−1
|α−2| + 1 otherwise
α
The simplest form of our loss function is: (8)
!α/2 As we have shown, this loss function is a superset of
2
|α − 2| (x/c) the Welsch/Leclerc, Geman-McClure, Cauchy/Lorentzian,
f (x, α, c) = +1 − 1 (1)
α |α − 2| generalized Charbonnier, Charbonnier/pseudo-Huber/L1-
L2, and L2 loss functions.
Here α ∈ R is a shape parameter that controls the robust- To enable gradient-based optimization we can derive the
ness of the loss and c > 0 is a scale parameter that controls derivative of ρ (x, α, c) with respect to x:
the size of the loss’s quadratic bowl near x = 0.
x
Though our loss is undefined when α = 2, it approaches
c2 if α = 2
2x
if α = 0
L2 loss (squared error) in the limit: ∂ρ
x2 +2c2
(x, α, c) = x exp − 1 (x/c)2 if α = −∞
1 x 2 ∂x c2 2
lim f (x, α, c) = ( /c) (2)
x 2 (α/2−1)
α→2 2 x2 ( /c) + 1
otherwise
c |α−2|
When α = 1 our loss is a smoothed form of L1 loss: (9)
Our loss and its derivative are visualized for different values
p
f (x, 1, c) = (x/c)2 + 1 − 1 (3)
of α in Figure 1.
This is often referred to as Charbonnier loss [6], pseudo- The shape of the derivative gives some intuition as to
Huber loss (as it resembles Huber loss [19]), or L1-L2 loss how α affects behavior when our loss is being minimized by
[40] (as it behaves like L2 loss near the origin and like L1 gradient descent or some related method. For all values of α
loss elsewhere). the derivative is approximately linear when |x| < c, so the
Our loss’s ability to express L2 and smoothed L1 losses effect of a small residual is always linearly proportional to
is shared by the “generalized Charbonnier” loss [35], which that residual’s magnitude. If α = 2, the derivative’s magni-
has been used in flow and depth estimation tasks that require tude stays linearly proportional to the residual’s magnitude
robustness [7, 24] and is commonly defined as: — a larger residual has a correspondingly larger effect. If
α/2 α = 1 the derivative’s magnitude saturates to a constant 1/c
x 2 + 2 (4) as |x| grows larger than c, so as a residual increases its ef-
Our loss has significantly more expressive power than the fect never decreases but never exceeds a fixed amount. If
generalized Charbonnier loss, which we can see by set- α < 1 the derivative’s magnitude begins to decrease as |x|
ting our shape parameter α to nonpositive values. Though grows larger than c (in the language of M-estimation [17],
f (x, 0, c) is undefined, we can take the limit of f (x, α, c) the derivative, aka “influence”, is “redescending”) so as the
as α approaches zero: residual of an outlier increases, that outlier has less effect
during gradient descent. The effect of an outlier diminishes
1 x 2 as α becomes more negative, and as α approaches −∞ an
lim f (x, α, c) = log ( /c) + 1 (5)
α→0 2 outlier whose residual magnitude is larger than 3c is almost
This yields Cauchy (aka Lorentzian) loss [2]. By setting completely ignored.
α = −2, our loss reproduces Geman-McClure loss [14]: We can also reason about α in terms of averages. Be-
cause the empirical mean of a set of values minimizes total
2
2 (x/c) squared error between the mean and the set, and the empir-
f (x, −2, c) = 2 (6)
(x/c) + 4 ical median similarly minimizes absolute error, minimizing
Z ∞
our loss with α = 2 is equivalent to estimating a mean, and Z (α) = exp (−ρ (x, α, 1)) (17)
with α = 1 is similar to estimating a median. Minimizing −∞
our loss with α = −∞ is equivalent to local mode-finding
[36]. Values of α between these extents can be thought of where p (x | µ, α, c) is only defined if α ≥ 0, as Z (α) is
as smoothly interpolating between these three kinds of av- divergent when α < 0. For some values of α the partition
erages during estimation. function is relatively straightforward:
Our loss function has several useful properties that we √
Z (0) = π 2 Z (1) = 2eK1 (1)
will take advantage of. The loss is smooth (i.e., in C ∞ ) √ 1
with respect to x, α, and c > 0, and is therefore well-suited Z (2) = 2π Z (4) = e /4 K1/4 (1/4) (18)
to gradient-based optimization over its input and its param-
where Kn (·) is the modified Bessel function of the second
eters. The loss is zero at the origin, and increases monoton-
kind. For any rational positive α (excluding a singularity at
ically with respect to |x|:
α = 2) where α = n/d with n, d ∈ N, we see that
∂ρ
ρ (0, α, c) = 0 (x, α, c) ≥ 0 (10) q
n e| n −1| 2d
2d 2d !
∂|x| n −1
ap 1 1
0,0
Z = Gp,q −
The loss is invariant to a simultaneous scaling of c and x: d (2π)(d−1) bq n 2d
∀k>0 ρ(kx, α, kc) = ρ(x, α, c) (11) i 1 3 i
bq = i = − , ..., n − ∪ i = 1, ..., 2d − 1
n 2 2 2d
The loss increases monotonically with respect to α:
i
∂ρ ap = i = 1, ..., n − 1 (19)
(x, α, c) ≥ 0 (12) n
∂α
where G(·) is the Meijer G-function and bq is a multiset
This is convenient for graduated non-convexity [4]: we can
(items may occur twice). Because the partition function
initialize α such that our loss is convex and then gradually
is difficult to evaluate or differentiate, in our experiments
reduce α (and therefore reduce convexity and increase ro-
we approximate log(Z (α)) with a cubic hermite spline (see
bustness) during optimization, thereby enabling robust esti-
Appendix C for details).
mation that (often) avoids local minima.
Just as our loss function includes several common loss
We can take the limit of the loss as α approaches infinity,
function as special cases, our distribution includes several
which due to Eq. 12 must be the upper bound of the loss:
common distributions as special cases. When α = 2 our
distribution becomes a normal (Gaussian) distribution, and
1 x 2
ρ (x, α, c) ≤ lim ρ (x, α, c) = exp ( /c) − 1 when α = 0 our distribution becomes a Cauchy distri-
α→+∞ 2
(13) bution. These are also both special cases of Student’s t-
We can bound the magnitude of the gradient of the loss, distribution (ν = ∞ and ν = 1, respectively), though these
which allows us to better reason about exploding gradients: are the only two points where these two families of distribu-
( α−1 tions intersect. Our distribution resembles the generalized
2 )
Gaussian distribution [29, 34], except that it is “smoothed”
1 α−2
∂ρ ≤ 1c if α ≤ 1
∂x (x, α, c) ≤ |x| (14)
c α−1
so as to approach a Gaussian distribution near the origin re-
c2 if α ≤ 2 gardless of the shape parameter α. The PDF and NLL of our
L1 loss is not expressible by our loss, but if c is much distribution for different values of α can be seen in Figure 2.
smaller than x we can approximate it with α = 1: In later experiments we will use the NLL of our general
distribution − log(p(·|α, c)) as the loss for training our neu-
|x| ral networks, not our general loss ρ (·, α, c). Critically, us-
f (x, 1, c) ≈ −1 if c x (15)
c ing the NLL allows us to treat α as a free parameter, thereby
See Appendix E for other potentially-useful properties that allowing optimization to automatically determine the de-
are not used in our experiments. gree of robustness that should be imposed by the loss be-
ing used during training. To understand why the NLL must
2. Probability Density Function be used for this, consider a training procedure in which we
simply minimize ρ (·, α, c) with respect to α and our model
With our loss function we can construct a general prob- weights. In this scenario, the monotonicity of our general
ability distribution, such that the negative log-likelihood loss with respect to α (Eq. 12) means that optimization can
(NLL) of its PDF is a shifted version of our loss function: trivially minimize the cost of outliers by setting α to be as
1 small as possible. Now consider that same training pro-
p (x | µ, α, c) = exp (−ρ (x − µ, α, c)) (16) cedure in which we minimize the NLL of our distribution
cZ (α)
3. Experiments
We will now demonstrate the utility of our loss function
and distribution with four experiments. None of these re-
sults are intended to represent the state-of-the-art for any
particular task — our goal is to demonstrate the value of our
loss and distribution as useful tools in isolation. We will
show that across a variety of tasks, just replacing the loss
function of an existing model with our general loss function
can enable significant performance improvements.
In Sections 3.1 and 3.2 we focus on learning based vi-
Figure 2. The negative log-likelihoods (left) and probability den-
sion tasks in which training involves minimizing the differ-
sities (right) of the distribution corresponding to our loss function
ence between images: variational autoencoders for image
when it is defined (α ≥ 0). NLLs are simply losses (Fig. 1) shifted
by a log partition function. Densities are bounded by a scaled synthesis and self-supervised monocular depth estimation.
Cauchy distribution. We will generalize and improve models for both tasks by
using our general distribution (either as a conditional dis-
tribution in a generative model or by using its NLL as an
adaptive loss) and allowing the distribution to automatically
instead of our loss. As can be observed in Figure 2, reduc- determine its own degree of robustness. Because robustness
ing α will decrease the NLL of outliers but will increase is automatic and requires no manually-tuned hyperparame-
the NLL of inliers. During training, optimization will have ters, we can even allow for the robustness of our loss to
to choose between reducing α, thereby getting “discount” be adapted individually for each dimension of our output
on large errors at the cost of paying a penalty for small er- space — we can have a different degree of robustness at
rors, or increasing α, thereby incurring a higher cost for each pixel in an image, for example. As we will show, this
outliers but a lower cost for inliers. This tradeoff forces op- approach is particularly effective when combined with im-
timization to judiciously adapt the robustness of the NLL age representations such as wavelets, in which we expect to
being minimized. As we will demonstrate later, allowing see non-Gaussian, heavy-tailed distributions.
the NLL to adapt in this way can increase performance on In Sections 3.3 and 3.4 we will build upon existing al-
a variety of learning tasks, in addition to obviating the need gorithms for two classic vision tasks (registration and clus-
for manually tuning α as a fixed hyperparameter. tering) that both work by minimizing a robust loss that is
Sampling from our distribution is straightforward given subsumed by our general loss. We will then replace each
the observation that − log (p (x | 0, α, 1)) is bounded from algorithm’s fixed robust loss with our loss, thereby intro-
below by ρ(x, 0, 1) + log(Z(α)) (shifted Cauchy loss). See ducing a continuous tunable robustness parameter α. This
Figure 2 for visualizations of this bound when α = ∞, generalization allows us to introduce new models in which
which also bounds the NLL for all values of α. This lets α is manually tuned or annealed, thereby improving per-
us perform rejection sampling using a Cauchy as the pro- formance. These results demonstrate the value of our loss
posal distribution. Because our distribution is a location- function when designing classic vision algorithms, by al-
scale family, we sample from p (x | 0, α, 1) and then scale lowing model robustness to be introduced into the algorithm
and shift that sample by c and µ respectively. This sam- design space as a continuous hyperparameter.
pling approach is efficient, with an acceptance rate between 3.1. Variational Autoencoders
∼ 45% (α = ∞) and 100% (α = 0). Pseudocode for sam-
pling is shown in Algorithm 1. Variational autoencoders [23, 31] are a landmark tech-
nique for training autoencoders as generative models, which
can then be used to draw random samples that resemble
training data. We will demonstrate that our general distribu-
Algorithm 1 Sampling from our general distribution tion can be used to improve the log-likelihood performance
Input: Parameters for the distribution to sample {µ, α, c} of VAEs for image synthesis on the CelebA dataset [27]. A
Output: A sample drawn from p (x | µ, α, c). common design decision for VAEs is to model images us-
1: while True:
ing an independent normal distribution on a vector of RGB
√ pixel values [23], and we use this design as our baseline
2: x ∼ Cauchy(x0 = 0, γ = 2)
3: u ∼ Uniform(0, 1) model. Recent work has improved upon this model by us-
p(x | 0,α,1) ing deep, learned, and adversarial loss functions [9, 16, 25].
4: if u < exp(−ρ(x,0,1)−log(Z(α))) :
Though it’s possible that our general loss or distribution
5: return cx + µ
can add value in these circumstances, to more precisely iso-
late our contribution we will explore the hypothesis that the Normal Cauchy t-dist. Ours
baseline model of normal distributions placed on a per-pixel Pixels + RGB 8,662 9,602 10,177 10,240
image representation can be improved significantly with the DCT + YUV 31,837 31,295 32,804 32,806
small change of just modeling a linear transformation of a Wavelets + YUV 31,505 35,779 36,373 36,316
VAE’s output with our general distribution. Again, our goal Table 1. Validation set ELBOs (higher is better) for our varia-
is not to advance the state of the art for any particular im- tional autoencoders. Models using our general distribution better
age synthesis task, but is instead to explore the value of our maximize the likelihood of unseen data than those using normal
distribution in an experimentally controlled setting. or Cauchy distributions (both special cases of our model) for all
In our baseline model we give each pixel’s normal distri- three image representations, and perform similarly to Student’s t-
bution a variable scale parameter σ (i) that will be optimized distribution (a different generalization of normal and Cauchy dis-
tributions). The best and second best performing techniques for
over during training, thereby allowing the VAE to adjust the
each representation are colored orange and yellow respectively.
scale of its distribution for each output dimension. We can
straightforwardly replace this per-pixel normal distribution
Normal Cauchy t-distribution Ours
with a per-pixel general distribution, in which each output
dimension is given a distinct shape parameter α(i) in ad-
Pixels + RGB
dition to its scale parameter c(i) (i.e., σ (i) ). By letting the
α(i) parameters be free variables alongside the scale param-
eters, training is able to adaptively select both the scale and
robustness of the VAE’s posterior distribution over pixel
values. We restrict all α(i) to be in (0, 3), which allows
our distribution to generalize Cauchy (α = 0) and Normal
(α = 2) distributions and anything in between, as well as
DCT + YUV
tently outperforms the other across representations. Note Table 2. Results on unsupervised monocular depth estimation us-
that the t-distribution’s NLL does not generalize the Char- ing the KITTI dataset [13], building upon the model from [42]
bonnier, L1, Geman-McClure, or Welsch losses, so unlike (“Baseline”). By replacing the per-pixel loss used by [42] with
ours it will not generalize the losses used in the other tasks several variants of our own per-wavelet general loss function in
we will address. For all representations, VAEs trained with which our loss’s shape parameters are fixed, annealed, or adap-
our general distribution produce sharper and more detailed tive, we see a significant performance improvement. The top three
techniques are colored red, orange, and yellow for each metric.
samples than those trained with normal distributions. Mod-
els trained with Cauchy and t-distributions preserve high-
frequency detail and work well on pixel representations,
Input
but systematically fail to synthesize low-frequency image
content when given non-pixel representations, as evidenced
by the gray backgrounds of those samples. Comparing
performance across image representations shows that the
Baseline
“Wavelets + YUV” representation best maximizes valida-
tion set ELBO — though if we were to limit our model to
only normal distributions the “DCT + YUV” model would
appear superior, suggesting that there is value in reason-
Ours
scale c fixed to 0.01, thereby matching the fixed scale as- Table 3. Results on the registration task of [41], in which we
compare their “FGR” algorithm to two versions of our “gFGR”
sumption of the baseline model and roughly matching the
generalization.
shape of its L1 loss (Eq. 15). To avoid exploding gradients
we multiply the loss being minimized by c, thereby bound-
ing gradient magnitudes by residual magnitudes (Eq. 14).
For the “fixed” models we use a constant value for α for all
wavelet coefficients, and observe that though performance
is improved relative to the baseline, no single value of α is
optimal. The α = 1 entry is simply a smoothed version
of the L1 loss used by the baseline model, suggesting that
just using a wavelet representation improves performance.
In the “annealing α = 2 → 0” model we linearly inter-
polate α from 2 (L2) to 0 (Cauchy) as a function of train-
ing iteration, which outperforms all “fixed” models. In the Figure 5. Performance (lower is better) of our gFGR algorithm
“adaptive α ∈ (0, 2)” model we assign each wavelet co- on the task of [41] as we vary our shape parameter α, with the
efficient its own shape parameter as a free variable and we lowest-error point indicated by a circle. FGR (equivalent to gFGR
allow those variables to be optimized alongside our network with α = −2) is shown as a dashed line and a square, and shape-
annealed gFGR for each noise level is shown as a dotted line.
weights during training as was done in Section 3.1, but with
αmin = 0 and αmax = 2. This “adaptive” strategy out-
performs the “annealing” and all “fixed” strategies, thereby where ρgm (·) is Geman-McClure loss. By using the Black
demonstrating the value of allowing the model to adaptively and Rangarajan duality between robust estimation and line
determine the robustness of its loss during training. Note processes [3] FGR is capable of producing high-quality reg-
that though the “fixed” and “annealed” strategies only re- istrations at high speeds. Because Geman-McClure loss is a
quire our general loss, the “adaptive” strategy requires that special case of our loss, and because we can formulate our
we use the NLL of our general distribution as our loss — loss as an outlier process (see Appendix A), we can gener-
otherwise training would simply drive α to be as small as alize FGR to an arbitrary shape parameter α by replacing
possible due to the monotonicity of our loss with respect ρgm (·, c) with our ρ(·, α, c) (where setting α = −2 repro-
to α, causing performance to degrade to the “fixed α = 0” duces FGR).
model. Comparing the “adaptive” model’s performance to This generalized FGR (gFGR) enables algorithmic im-
that of the “fixed” models suggests that, as in Section 3.1, provements. FGR iteratively solves a linear system while
no single setting of α is optimal for all wavelet coefficients. annealing its scale parameter c, which has the effect of grad-
Overall, we see that just replacing the loss function of [42] ually introducing nonconvexity. gFGR enables an alterna-
with our adaptive loss on wavelet coefficients reduces aver- tive strategy in which we directly manipulate convexity by
age error by ∼ 17%. annealing α instead of c. This “shape-annealed gFGR” fol-
In Figure 4 we compare our “adaptive” model’s out- lows the same procedure as [41]: 64 iterations in which a
put to the baseline model and the ground-truth depth, and parameter is annealed every 4 iterations. Instead of anneal-
demonstrate a substantial qualitative improvement. See Ap- ing c, we set it to its terminal value and instead anneal α
pendix H for many more results, and for visualizations of over the following values:
the per-coefficient robustness selected by our model.
2, 1, 1/2, 1/4, 0, −1/4, −1/2, −1, −2, −4, −8, −16, −32
3.3. Fast Global Registration Table 3 shows results for the 3D point cloud registration
task of [41] (Table 1 in that paper), which shows that an-
Robustness is often a core component of geometric regis- nealing shape produces moderately improved performance
tration [38]. The Fast Global Registration (FGR) algorithm over FGR for high-noise inputs, and behaves equivalently
of [41] finds the rigid transformation T that aligns point sets in low-noise inputs. This suggests that performing gradu-
{p} and {q} by minimizing the following loss: ated non-convexity by directly adjusting a shape parameter
X that controls non-convexity — a procedure that is enabled
ρgm (kp − Tqk, c) (23) by our general loss – is preferable to indirectly controlling
(p,q) non-convexity by annealing a scale parameter.
RCC-DR [32]
LDMGI [37]
N-Cuts [33]
Rel. Impr.
RCC [32]
PIC [39]
gRCC*
AC-W
Dataset
YaleB 0.767 0.928 0.945 0.941 0.974 0.975 0.975 0.4%
COIL-100 0.853 0.871 0.888 0.965 0.957 0.957 0.962 11.6%
MNIST 0.679 - 0.761 - 0.828 0.893 0.901 7.9%
YTF 0.801 0.752 0.518 0.676 0.874 0.836 0.888 31.9%
Pendigits 0.728 0.813 0.775 0.467 0.854 0.848 0.871 15.1%
Mice Protein 0.525 0.536 0.527 0.394 0.638 0.649 0.650 0.2%
Reuters 0.471 0.545 0.523 0.057 0.553 0.556 0.561 1.1%
Shuttle 0.291 0.000 0.591 - 0.513 0.488 0.493 0.9%
RCV1 0.364 0.140 0.382 0.015 0.442 0.138 0.338 23.2%
Table 4. Results on the clustering task of [32] where we compare
their “RCC” algorithm to our “gRCC*” generalization in terms
of AMI on several datasets. We also report the AMI increase of
“gRCC*” with respect to “RCC”. Baselines are taken from [32].
Another generalization is to continue using the c- Figure 6. Performance (higher is better) of our gRCC algorithm
annealing strategy of [41], but treat α as a hyperparameter on the clustering task of [32], for different values of our shape
and tune it independently for each noise level in this task. parameter α, with the highest-accuracy point indicated by a dot.
In Figure 5 we set α to a wide range of values and report Because the baseline RCC algorithm is equivalent to gRCC with
errors for each setting, using the same evaluation of [41]. α = −2, we highlight that α value with a dashed line and a square.
We see that for high-noise inputs more negative values of
α are preferable, but for low-noise inputs values closer to 4. Conclusion
0 are optimal. We report the lowest-error entry for each
noise level as “gFGR*” in Table 3 where we see a signifi- We have presented a two-parameter loss function
cant reduction in error, thereby demonstrating the improve- that generalizes many existing one-parameter ro-
ment that can be achieved from treating robustness as a hy- bust loss functions: the Cauchy/Lorentzian, Geman-
perparameter. McClure, Welsch/Leclerc, generalized Charbonnier,
Charbonnier/pseudo-Huber/L1-L2, and L2 loss functions.
By reducing a family of discrete single-parameter losses
3.4. Robust Continuous Clustering
to a single function with two continuous parameters, our
In [32] robust losses are used for unsupervised cluster- loss enables the convenient exploration and comparison
ing, by minimizing: of different robust penalties. This allows us to generalize
and improve algorithms designed around the minimiza-
X 2
X tion of some fixed robust loss function, which we have
kxi − ui k2 + λ wp,q ρgm (kup − uq k2 ) (24)
demonstrated for registration and clustering. When used
i (p,q)∈E
as a negative log-likelihood, this loss gives a general
probability distribution that includes normal and Cauchy
where {xi } is a set of input datapoints, {ui } is a set of “rep- distributions as special cases. This distribution lets us train
resentatives” (cluster centers), and E is a mutual k-nearest neural networks in which the loss has an adaptive degree
neighbors (m-kNN) graph. As in Section 3.3, ρgm (·) is of robustness for each output dimension, which allows
Geman-McClure loss, which means that our loss can be training to automatically determine how much robustness
used to generalize this algorithm. Using the RCC code should be imposed by the loss without any manual param-
provided by the authors (and keeping all hyperparameters eter tuning. When this adaptive loss is paired with image
fixed to their default values) we replace Geman-McClure representations in which variable degrees of heavy-tailed
loss with our general loss and then sweep over values of α. behavior occurs, such as wavelets, this adaptive training ap-
In Figure 6 we show the adjusted mutual information (AMI, proach allows us to improve the performance of variational
the metric used by [32]) of the resulting clustering for each autoencoders for image synthesis and of neural networks
value of α on the datasets used in [32], and in Table 4 we for unsupervised monocular depth estimation.
report the AMI for the best-performing value of α for each
dataset as “gRCC*”. On some datasets performance is in- Acknowledgements: Thanks to Rob Anderson, Jesse En-
sensitive to α, but on others adjusting α improves perfor- gel, David Gallup, Ross Girshick, Jaesik Park, Ben Poole,
mance by as much as 32%. This improvement demonstrates Vivek Rathod, and Tinghui Zhou.
the gains that can be achieved by introducing robustness as
a hyperparameter and tuning it accordingly.
References [21] John E. Dennis Jr. and Roy E. Welsch. Techniques for non-
linear least squares and robust regression. Communications
[1] Nasir Ahmed, T Natarajan, and Kamisetty R Rao. Discrete in Statistics-simulation and Computation, 1978.
cosine transform. IEEE Transactions on Computers, 1974.
[22] Diederik P. Kingma and Jimmy Ba. Adam: A method for
[2] Michael J Black and Paul Anandan. The robust estimation stochastic optimization. ICLR, 2015.
of multiple motions: Parametric and piecewise-smooth flow
[23] Diederik P. Kingma and Max Welling. Auto-encoding vari-
fields. CVIU, 1996.
ational bayes. ICLR, 2014.
[3] Michael J. Black and Anand Rangarajan. On the unification
[24] Philipp Krähenbühl and Vladlen Koltun. Efficient nonlocal
of line processes, outlier rejection, and robust statistics with
regularization for optical flow. ECCV, 2012.
applications in early vision. IJCV, 1996.
[25] Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo
[4] Andrew Blake and Andrew Zisserman. Visual Reconstruc-
Larochelle, and Ole Winther. Autoencoding beyond pixels
tion. MIT Press, 1987.
using a learned similarity metric. ICML, 2016.
[5] Richard H. Byrd and David A. Pyne. Convergence of the
[26] Yvan G Leclerc. Constructing simple stable descriptions for
iteratively reweighted least-squares algorithm for robust re-
image partitioning. IJCV, 1989.
gression. Technical report, Dept. of Mathematical Sciences,
[27] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
The Johns Hopkins University, 1979.
Deep learning face attributes in the wild. ICCV, 2015.
[6] Pierre Charbonnier, Laure Blanc-Feraud, Gilles Aubert, and
[28] Stéphane Mallat. A theory for multiresolution signal decom-
Michel Barlaud. Two deterministic half-quadratic regular-
position: The wavelet representation. TPAMI, 1989.
ization algorithms for computed imaging. ICIP, 1994.
[29] Saralees Nadarajah. A generalized normal distribution. Jour-
[7] Qifeng Chen and Vladlen Koltun. Fast mrf optimization with
nal of Applied Statistics, 2005.
application to depth reconstruction. CVPR, 2014.
[30] Javier Portilla, Vasily Strela, Martin J. Wainwright, and
[8] Albert Cohen, Ingrid Daubechies, and J-C Feauveau.
Eero P. Simoncelli. Image denoising using scale mixtures
Biorthogonal bases of compactly supported wavelets. Com-
of gaussians in the wavelet domain. IEEE TIP, 2003.
munications on pure and applied mathematics, 1992.
[31] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wier-
[9] Alexey Dosovitskiy and Thomas Brox. Generating images
stra. Stochastic backpropagation and approximate inference
with perceptual similarity metrics based on deep networks.
in deep generative models. ICML, 2014.
NIPS, 2016.
[32] Sohil Atul Shah and Vladlen Koltun. Robust continuous
[10] David J. Field. Relations between the statistics of natural
clustering. PNAS, 2017.
images and the response properties of cortical cells. JOSA A,
1987. [33] Jianbo Shi and Jitendra Malik. Normalized cuts and image
segmentation. TPAMI, 2000.
[11] John Flynn, Ivan Neulander, James Philbin, and Noah
Snavely. Deepstereo: Learning to predict new views from [34] M Th Subbotin. On the law of frequency of error. Matem-
the world’s imagery. CVPR, 2016. aticheskii Sbornik, 1923.
[12] Ravi Garg, BG Vijay Kumar, Gustavo Carneiro, and Ian [35] Deqing Sun, Stefan Roth, and Michael J. Black. Secrets of
Reid. Unsupervised cnn for single view depth estimation: optical flow estimation and their principles. CVPR, 2010.
Geometry to the rescue. ECCV, 2016. [36] Rein van den Boomgaard and Joost van de Weijer. On
[13] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we the equivalence of local-mode finding, robust estimation and
ready for autonomous driving? the kitti vision benchmark mean-shift analysis as used in early vision tasks. ICPR, 2002.
suite. CVPR, 2012. [37] Yi Yang, Dong Xu, Feiping Nie, Shuicheng Yan, and Yueting
[14] Stuart Geman and Donald E. McClure. Bayesian image anal- Zhuang. Image clustering using local discriminant models
ysis: An application to single photon emission tomography. and global integration. TIP, 2010.
Proceedings of the American Statistical Association, 1985. [38] Christopher Zach. Robust bundle adjustment revisited.
[15] Clément Godard, Oisin Mac Aodha, and Gabriel J. Bros- ECCV, 2014.
tow. Unsupervised monocular depth estimation with left- [39] Wei Zhang, Deli Zhao, and Xiaogang Wang. Agglomerative
right consistency. CVPR, 2017. clustering via maximum incremental path integral. Pattern
[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Recognition, 2013.
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and [40] Zhengyou Zhang. Parameter estimation techniques: A tuto-
Yoshua Bengio. Generative adversarial nets. NIPS, 2014. rial with application to conic fitting, 1995.
[17] Frank R. Hampel, Elvezio M. Ronchetti, Peter J. Rousseeuw, [41] Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. Fast global
and Werner A. Stahel. Robust Statistics: The Approach registration. ECCV, 2016.
Based on Influence Functions. Wiley, 1986. [42] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G.
[18] Trevor Hastie, Robert Tibshirani, and Martin Wainwright. Lowe. Unsupervised learning of depth and ego-motion from
Statistical Learning with Sparsity: The Lasso and General- video. CVPR, 2017.
izations. Chapman and Hall/CRC, 2015.
[19] Peter J. Huber. Robust estimation of a location parameter.
Annals of Mathematical Statistics, 1964.
[20] Peter J. Huber. Robust Statistics. Wiley, 1981.
A. Alternative Forms
− log(z) + z − 1 if α = 0
− z +1 α
Ψ(z, α) = z log(z) if α = −∞ Figure 7. Our general loss’s IRLS weight function (left) and Ψ-
|α−2| 1 − α z (α−2) +
αz
−1 if α < 2 function (right) for different values of the shape parameter α.
α 2 2
1 x 2
ρ(x, α, c) = min ( /c) z + Ψ(z, α) (25)
0≤z≤1 2
!(d/2)
2
b (x/c)
ρ (x, α, c) = +1 − 1
Ψ(z, α) is not defined when α ≥ 2 because for those values d b
the loss is no longer robust, and so is not well described as !(d/2−1)
2
a process that rejects outliers. ∂ρ x (x/c)
(x, α, c) = 2 +1
∂x c b
We can also derive our loss’s weight function to be used
( d
during iteratively reweighted least squares [5, 17]: b
1 − d2 z (d−2) + dz
d 2 −1 if α < 2
Ψ(z, α) =
0 if α = 2
(
1
c2
if α = 2 α + if α ≥ 0
2 b =|α − 2| + d=
if α = 0 α − if α < 0
1 ∂ρ x2 +2c 2
(x, α, c) = 1 exp − 1 (x/c)2 if α = −∞ Where is some small value, such as 10−5 . Note that even
x ∂x c2 2
very small values of can cause significant inaccuracy be-
(α/2−1)
12 (x/c)2 + 1
otherwise
c |α−2| tween our true partition function Z (α) and the effective
(26) partition function of our approximate distribution when α
Curiously, these IRLS weights resemble a non-normalized is near 0, so this approximate implementation should be
form of Student’s t-distribution. These weights are not used avoided when accurate values of Z (α) are necessary.
in any of our experiments, but they are an intuitive way to
demonstrate how reducing α attenuates the effect of out-
liers. A visualization of our loss’s Ψ-functions and weight C. Partition Function Approximation
functions for different values of α can be seen in Figure 7.
Implementing the negative log-likelihood of our general
distribution (ie, our adaptive loss) requires a tractable and
differentiable approximation of its log partition function.
B. Practical Implementation Because the analytical form of Z (α) detailed in the pa-
per is difficult to evaluate efficiently for any real number,
The special cases in the definition of ρ (·) that are and especially difficult to differentiate with respect to α,
required because of the removable singularities of f (·) we approximate log(Z (α)) using cubic hermite spline in-
at α = 0 and α = 2 can make implementing our terpolation in a transformed space. Efficiently approximat-
loss somewhat inconvenient. Additionally, f (·) is nu- ing log(Z (α)) with a spline is difficult, as we would like a
merically unstable near these singularities, due to divi- concise approximation that holds over the entire valid range
sions by small values. Furthermore, many deep learning α ≥ 0, and we would like to allocate more precision in our
frameworks handle special cases inefficiently by evaluat- spline interpolation to values near α = 2 (which is where
ing all cases of a conditional statement, even though only log(Z (α)) varies most rapidly). To accomplish this, we
one case is needed. To circumvent these issues we can first apply a monotonic nonlinearity to α that stretches val-
slightly modify our loss (and its gradient and Ψ-function) to ues near α = 2 (thereby increasing the density of spline
guard against singularities and make implementation easier: knots in this region) and compresses values as α 4, for
which we use:
(
9(α−2)
4|α−2|+1 + α + 2 if α < 4
curve(α) = 5
(27)
18 log (4α − 15) + 8 otherwise
tangents for each knot are set to minimize the squared er-
ror between the spline and the true log partition function.
Because our spline knots are evenly spaced in this trans-
formed space, spline interpolation can be performed in con-
stant time with respect to the number of spline knots. For all
values of α this approximation is accurate to within 10−6 , Figure 8. Because our distribution’s log partition function
which appears to be sufficient for our purposes. Our non- log(Z (α)) is difficult to evaluate for arbitrary inputs, we approx-
linearity and our spline approximation to the true partition imate it using cubic hermite spline interpolation in a transformed
function for small values of α can be seen in Figure 8. space: first we curve α by a continuously differentiable nonlin-
earity that increases knot density near α = 2 and decreases knot
D. Motivation and Derivation density when α > 4 (top) and then we fit an evenly-sampled cubic
hermite spline in that curved space (bottom). The dots shown in
Our loss function is derived from the “generalized Char- the bottom plot are a subset of the knots used by our cubic spline,
bonnier” loss [35], which itself builds upon the Charbon- and are presented here to demonstrate how this approach allocates
nier loss function [6]. To better motivate the construction spline knots with respect to α.
of our loss function, and to clarify its relationship to prior
work, here we work through how our loss function was con-
structed. α to a negative value yield a family of meaningful robust
Generalized Charbonnier loss can be defined as: loss functions, such as Geman-McClure loss.
But this loss function still has several unintuitive proper-
α/2
d (x, α, c) = x2 + c2 (28) ties: the loss is non-zero when x = 0 (assuming a non-zero
value of c), and the curvature of the quadratic “bowl” near
Here we use a slightly different parametrization from [35] x = 0 varies as a function of c and α. We therefore con-
and use α/2 as the exponent instead of just α. This makes the struct a shifted and scaled version of Equation 30 that does
generalized Charbonnier somewhat easier to reason about not have these properties:
with respect to standard loss functions: d (x, 2, c) resembles
L2 loss, d (x, 1, c) resembles L1 loss, etc. g (x, α, c) − g (0, α, c) 1 x 2 a/2
= ( /c) + 1 − 1
We can reparametrize generalized Charbonnier loss as: c2 g 00 (0, α, c) α
α/2 (32)
2 This loss generalizes L2, Cauchy, and Geman-McClure
d (x, α, c) = cα (x/c) + 1 (29)
loss, but it has the unfortunate side-effect of flattening out to
We omit the cα scale factor, which gives us a loss that is 0 when α 0, thereby prohibiting many annealing strate-
scale invariant with respect to c: gies. This can be addressed by modifying the 1/α scaling
α/2 to approach 1 instead of 0 when α 0 by introducing an-
g (x, α, c) = (x/c)2 + 1 (30) other scaling that cancels out the division by α. To preserve
∀k>0 g(kx, α, kc) = g(x, α, c) (31) the scale-invariance of Equation 31, this scaling also needs
2
to be applied to the (x/c) term in the loss. This scaling
This lets us view the c “padding” variable as a “scale” pa- also needs to maintain the monotonicity of our loss with
rameter, similar to other common robust loss functions. Ad- respect to α so as to make annealing possible. There are
ditionally, only after dropping this scale factor does setting several scalings that satisfy this property, so we select one
that is efficient to evaluation and which keeps our loss func- F. Wavelet Implementation
tion smooth (ie, having derivatives of all orders everywhere)
with respect to x, α, and c, which is |α − 2|. This gives us Two of our experiments impose our loss on im-
our final loss function: ages reparametrized with the Cohen-Daubechies-Feauveau
(CDF) 9/7 wavelet decomposition [8]. The analysis filters
!α/2 used for these experiments are:
2
|α − 2| (x/c)
f (x, α, c) = +1 − 1 (33)
α |α − 2| lowpass highpass
0.852698679009 0.788485616406
Using |α − 2| satisfies all of our criteria, though it does 0.377402855613 -0.418092273222
introduce a removable singularity into our loss function at -0.110624404418 -0.040689417609
α = 2 and reduces numerical stability near α = 2. -0.023849465020 0.064538882629
0.037828455507
E. Additional Properties
Here the origin coefficient of the filter is listed first, and the
Here we enumerate additional properties of our loss rest of the filter is symmetric. The synthesis filters are de-
function that were not used in our experiments. fined as usual, by reversing the sign of alternating wavelet
At the origin the IRLS weight of our loss is c12 : coefficients in the analysis filters. The lowpass filter sums
√
to 2, which means that image intensities are doubled at
1 ∂ρ 1 each scale of the wavelet decomposition, and that the mag-
(0, α, c) = 2 (34)
x ∂x c nitude of an image is preserved in its wavelet decomposi-
For all values of α, when |x| is small with respect to c the tion. Boundary conditions are “reflecting”, or half-sample
loss is well-approximated by a quadratic bowl: symmetric.
ρ (x, α, c) ≈
1 x 2
( /c) if |x| < c (35)
G. Variational Autoencoders
2
Our VAE experiments were performed using the
Because the second derivative of the loss is maximized at code included in the TensorFlow Probability codebase
x = 0, this quadratic approximation tells us that the second at http://github.com/tensorflow/probability/blob/
derivative is bounded from above: master/tensorflow_probability/examples/vae.py.
This code was designed for binarized MNIST data, so
∂2ρ 1 adapting it to the real-valued color images in CelebA [27]
2
(x, α, c) ≤ 2 (36)
∂x c required the following changes:
When α is negative the loss approaches a constant as |x|
• Changing the input and output image resolution from
approaches infinity, letting us bound the loss:
(28, 28, 1) to (64, 64, 3).
α−2
∀x,c ρ (x, α, c) ≤ if α < 0 (37) • Increasing the number of training steps from 5000 to
α 50000, as CelebA is significantly larger than MNIST.
The loss’s Ψ-function increases monotonically with respect
to α when α < 2 for all values of z in [0, 1]: • Delaying the start of cosine decay of the learning rate
until the final 10000 training iterations.
∂Ψ
(z, α) ≥ 0 if 0 ≤ z ≤ 1 (38) • Changing the CNN architecture from a 5-layer network
∂α
with 5-tap and 7-tap filters with interleaved strides of 1
The roots of the second derivative of ρ (x, α, c) are: and 2 (which maps from a 28 × 28 image to a vector)
r to a 6-layer network consisting of all 5-tap filters with
α−2 strides of 2 (which maps from a 64×64 input to a vector).
x = ±c (39)
α−1 The number of hidden units was left unchanged, and the
one extra layer we added at the end of our decoder (and
This tells us at what value of x the loss begins to redescend. beginning of our decoder) was given the same number of
This point has a magnitude of c when α = −∞, and that hidden units as the layer before it.
magnitude increases as α increases. The root is undefined
when α ≥ 1, as our loss is redescending iff α < 1. Our • In our “DCT + YUV” and “Wavelets + YUV” models,
loss is strictly convex iff α ≥ 1, non-convex iff α < 1, and before imposing our model’s posterior we apply an RGB-
pseudoconvex for all values of α. to-YUV transformation and then a per-channel DCT or
an independently-adapted model. As we can see, though
quality can be improved by selecting a value for α in be-
tween 0 and 2, no single global setting of the shape parame-
ter matches the performance achieved by allowing each co-
efficient’s shape parameter to automatically adapt itself to
the training data. This observation is consistent with ear-
lier results on adaptive heavy-tailed distributions for image
data [30].
In our Student’s t-distribution experiments, we
parametrize each “degrees of freedom” parameter as
the exponentiation of some latent free parameter:
(i)
ν (i) = exp ν` (40)
(i)
where all ν` are initialized to 0. Technically, these ex-
Figure 9. Here we compare the validation set ELBO of our adap- periments are performed with the “Generalized Student’s
tive “Wavelets + YUV” VAE model with the ELBO achieved when t-distribution”, meaning that we have an additional scale
setting all wavelet coefficients to have the same fixed shape pa- parameter σ (i) that is divided into x before computing the
rameter α. We see that allowing our distribution to individually log-likelihood and is accounted for in the partition function.
adapt its shape parameter to each coefficient outperforms any sin- These scale parameters are parametrized identically to the
gle fixed shape parameter. c(i) parameters used by our general distribution.
Comparing likelihoods across our different image repre-
wavelet transformation to the YUV images, and then in- sentations requires that the “Wavelets + YUV” and “DCT +
vert these transformations to visualize each sampled im- YUV” representations be normalized to match the “Pixels
age. In the “Pixels + RGB” model this transformation + RGB” representation. We therefore construct the linear
and its inverse are the identity function. transformations used for the “Wavelets + YUV” and “DCT
+ YUV” spaces to have determinants of 1 as per the change
• As discussed in the paper, for each output coefficient of variable formula (that is, both transformations are in the
(pixel value, DCT coefficient, or wavelet coefficient) we “special linear group”). Our wavelet construction in Sec-
add a scale variable (σ when using normal distributions, tion F satisfies this criteria, and we use the orthonormal ver-
c when using our general distributions) and a shape vari- sion of the DCT which also satisfies this criteria. However,
able α (when using our general distribution). the standard RGB to YUV conversion matrix does not have
We made as few changes to the reference code as possible a determinant of 1, so we scale it by the inverse of the cube
so as to keep our model architecture as simple as possible, root of the standard conversion matrix, thereby forcing its
as our goal is not to produce state-of-the-art image synthesis determinant to be 1. The resulting matrix is:
results for some task, but is instead to simply demonstrate
0.47249 0.92759 0.18015
the value of our general distribution in isolation. −0.23252 −0.45648 0.68900
CelebA [27] images are processed by extracting a square 0.97180 −0.81376 −0.15804
160 × 160 image region at the center of each 178 × 218
image and downsampling it to 64 × 64 by a factor of 2.5× Naturally, its inverse maps from YUV to RGB.
using TensorFlow’s bilinear interpolation implementation. Because our model can adapt the shape and scale pa-
Pixel intensities are scaled to [0, 1]. rameters of our general distribution to each output coeffi-
In the main paper we demonstrated that using our gen- cient, after training we can inspect the shapes and scales
eral distribution to independently model the robustness of that have emerged during training, and from them gain in-
each coefficient of our image representation works better sight into how optimization has modeled our training data.
than assuming a Cauchy (α = 0) or normal distribution In Figures 10 and 11 we visualize the shape and scale pa-
(α = 2) for all coefficients (as those two distributions lie rameters for our “Pixels + RGB” and “Wavelets + YUV”
within our general distribution). To further demonstrate the VAEs respectively. Our “Pixels” model is easy to visual-
value of independently modeling the robustness of each in- ize as each output coefficient simply corresponds to a pixel
dividual coefficient, we ran a more thorough experiment in in a channel, and our “Wavelets” model can be visualized
which we densely sampled values for α in [0, 2] that are by flattening each wavelet scale and orientation into an im-
used for all coefficients. In Figure 9 we visualize the val- age (our DCT-based model is difficult to visualize in any
idation set ELBO for each fixed value of α compared to intuitive way). In both models we observe that training has
Mean Reconstruction Sampled Reconstruction
{α(i) }
Pixels + RGB
{log c(i) }
R G B
DCT + YUV
Figure 10. The final shape and scale parameters {α(i) } and
{c(i) } for our “Pixels + RGB” VAE after training has con-
verged. We visualize α with black=0 and white=2 and log(c) with
black=log(0.002) and white=log(0.02).
{α(i) }
Wavelets + YUV
{log c(i) }
Y
the high α values in the chroma channel we can infer that
chroma variation rarely has outliers. Horizontal luma varia-
tion (upper right) tends to have larger α values than vertical
luma variation (lower left), perhaps because depth in this
dataset is largely due to horizontal motion, and so horizon-
U
Figure 14. Random samples (more precisely, means of the output distributions decoded from random samples in our latent space) from
our family of trained variational autoencoders.
Pixels + RGB DCT + YUV Wavelets + YUV
Input Normal Cauchy t-dist Ours Normal Cauchy t-dist Ours Normal Cauchy t-dist Ours
Figure 15. Reconstructions from our family of trained variational autoencoders, in which we use one of three different image represen-
tations for modeling images (super-columns) and use either normal, Cauchy, Student’s t, or our general distributions for modeling the
coefficients of each representation (sub-columns). The leftmost column shows the images which are used as input to each autoencoder.
Reconstructions from models using general distributions tend to be sharper and more detailed than reconstructions from the correspond-
ing model that uses normal distributions, particularly for the DCT or wavelet representations, though this difference is less pronounced
than what is seen when comparing samples from these models. The DCT and wavelet models trained with Cauchy distributions or Stu-
dent’s t-distributions systematically fail to preserve the background of the input image, as was noted when observing samples from these
distributions.
Input
Input
Baseline
Baseline
Ours
Ours
Truth
Truth
Input
Input
Baseline
Baseline
Ours
Ours
Truth
Truth
Input
Input
Baseline
Baseline
Ours
Ours
Truth
Truth
Figure 16. Monocular depth estimation results on the KITTI benchmark using the “Baseline” network of [42] and our own variant in
which we replace the network’s loss function with our own adaptive loss over wavelet coefficients. Changing only the loss function results
in significantly improved depth estimates.
Truth Ours Baseline Input Truth Ours Baseline Input Truth Ours Baseline Input
Truth Ours Baseline Input Truth Ours Baseline Input Truth Ours Baseline Input
Figure 17. Additional monocular depth estimation results, in the same format as Figure 16.
On the Origin of Deep Learning
Abstract
This paper is a review of the evolutionary history of deep learning models. It covers from
the genesis of neural networks when associationism modeling of the brain is studied, to the
models that dominate the last decade of research in deep learning like convolutional neural
networks, deep belief networks, and recurrent neural networks. In addition to a review of
these models, this paper primarily focuses on the precedents of the models above, examining
how the initial ideas are assembled to construct the early models and how these preliminary
models are developed into their current forms. Many of these evolutionary paths last more
than half a century and have a diversity of directions. For example, CNN is built on prior
knowledge of biological vision system; DBN is evolved from a trade-off of modeling power
and computation complexity of graphical models and many nowadays models are neural
counterparts of ancient linear models. This paper reviews these evolutionary paths and
offers a concise thought flow of how these models are developed, and aims to provide a
thorough background for deep learning. More importantly, along with the path, this paper
summarizes the gist behind these milestones and proposes many directions to guide the
future research of deep learning.
1
Wang and Raj
1. Introduction
Deep learning has dramatically improved the state-of-the-art in many different artificial
intelligent tasks like object detection, speech recognition, machine translation (LeCun et al.,
2015). Its deep architecture nature grants deep learning the possibility of solving many more
complicated AI tasks (Bengio, 2009). As a result, researchers are extending deep learning
to a variety of different modern domains and tasks in additional to traditional tasks like
object detection, face recognition, or language models, for example, Osako et al. (2015) uses
the recurrent neural network to denoise speech signals, Gupta et al. (2015) uses stacked
autoencoders to discover clustering patterns of gene expressions. Gatys et al. (2015) uses a
neural model to generate images with different styles. Wang et al. (2016) uses deep learning
to allow sentiment analysis from multiple modalities simultaneously, etc. This period is the
era to witness the blooming of deep learning research.
However, to fundamentally push the deep learning research frontier forward, one needs
to thoroughly understand what has been attempted in the history and why current models
exist in present forms. This paper summarizes the evolutionary history of several different
deep learning models and explains the main ideas behind these models and their relationship
to the ancestors. To understand the past work is not trivial as deep learning has evolved
over a long time of history, as showed in Table 1. Therefore, this paper aims to offer the
readers a walk-through of the major milestones of deep learning research. We will cover the
milestones as showed in Table 1, as well as many additional works. We will split the story
into different sections for the clearness of presentation.
This paper starts the discussion from research on the human brain modeling. Although
the success of deep learning nowadays is not necessarily due to its resemblance of the human
brain (more due to its deep architecture), the ambition to build a system that simulate brain
indeed thrust the initial development of neural networks. Therefore, the next section begins
with connectionism and naturally leads to the age when shallow neural network matures.
With the maturity of neural networks, this paper continues to briefly discuss the ne-
cessity of extending shallow neural networks into deeper ones, as well as the promises deep
neural networks make and the challenges deep architecture introduces.
With the establishment of the deep neural network, this paper diverges into three dif-
ferent popular deep learning topics. Specifically, in Section 4, this paper elaborates how
Deep Belief Nets and its construction component Restricted Boltzmann Machine evolve as a
trade-off of modeling power and computation loads. In Section 5, this paper focuses on the
development history of Convolutional Neural Network, featured with the prominent steps
along the ladder of ImageNet competition. In Section 6, this paper discusses the develop-
ment of Recurrent Neural Networks, its successors like LSTM, attention models and the
successes they achieved.
While this paper primarily discusses deep learning models, optimization of deep archi-
tecture is an inevitable topic in this society. Section 7 is devoted to a brief summary of
optimization techniques, including advanced gradient method, Dropout, Batch Normaliza-
tion, etc.
This paper could be read as a complementary of (Schmidhuber, 2015). Schmidhuber’s
paper is aimed to assign credit to all those who contributed to the present state of the art,
so his paper focuses on every single incremental work along the path, therefore cannot elab-
2
On the Origin of Deep Learning
3
Wang and Raj
orate well enough on each of them. On the other hand, our paper is aimed at providing the
background for readers to understand how these models are developed. Therefore, we em-
phasize on the milestones and elaborate those ideas to help build associations between these
ideas. In addition to the paths of classical deep learning models in (Schmidhuber, 2015),
we also discuss those recent deep learning work that builds from classical linear models.
Another article that readers could read as a complementary is (Anderson and Rosenfeld,
2000) where the authors conducted extensive interviews with well-known scientific leaders
in the 90s on the topic of the neural networks’ history.
4
On the Origin of Deep Learning
2.1 Associationism
“When, therefore, we accomplish an act of reminiscence, we pass through a
certain series of precursive movements, until we arrive at a movement on which
the one we are in quest of is habitually consequent. Hence, too, it is that we
hunt through the mental train, excogitating from the present or some other,
and from similar or contrary or coadjacent. Through this process reminiscence
takes place. For the movements are, in these cases, sometimes at the same time,
sometimes parts of the same whole, so that the subsequent movement is already
more than half accomplished.”
• Similarity: Thought of one event tends to trigger the thought of a similar event.
• Contrast: Thought of one event tends to trigger the thought of an opposite event.
Back then, Aristotle described the implementation of these laws in our mind as common
sense. For example, the feel, the smell, or the taste of an apple should naturally lead to
the concept of an apple, as common sense. Nowadays, it is surprising to see that these
laws proposed more than 2000 years ago still serve as the fundamental assumptions of
machine learning methods. For example, samples that are near each other (under a defined
distance) are clustered into one group; explanatory variables that frequently occur with
response variables draw more attention from the model; similar/dissimilar data are usually
represented with more similar/dissimilar embeddings in latent space.
Contemporaneously, similar laws were also proposed by Zeno of Citium, Epicurus and
St Augustine of Hippo. The theory of associationism was later strengthened with a variety
of philosophers or psychologists. Thomas Hobbes (1588-1679) stated that the complex
experiences were the association of simple experiences, which were associations of sensations.
He also believed that association exists by means of coherence and frequency as its strength
5
Wang and Raj
factor. Meanwhile, John Locke (1632-1704) introduced the concept of “association of ideas”.
He still separated the concept of ideas of sensation and ideas of reflection and he stated
that complex ideas could be derived from a combination of these two simple ideas. David
Hume (1711-1776) later reduced Aristotle’s four laws into three: resemblance (similarity),
contiguity, and cause and effect. He believed that whatever coherence the world seemed to
have was a matter of these three laws. Dugald Stewart (1753-1828) extended these three
laws with several other principles, among an obvious one: accidental coincidence in the
sounds of words. Thomas Reid (1710-1796) believed that no original quality of mind was
required to explain the spontaneous recurrence of thinking, rather than habits. James Mill
(1773-1836) emphasized on the law of frequency as the key to learning, which is very similar
to later stages of research.
David Hartley (1705-1757), as a physician, was remarkably regarded as the one that
made associationism popular (Hartley, 2013). In addition to existing laws, he proposed his
argument that memory could be conceived as smaller scale vibrations in the same regions
of the brain as the original sensory experience. These vibrations can link up to represent
complex ideas and therefore act as a material basis for the stream of consciousness. This
idea potentially inspired Hebbian learning rule, which will be discussed later in this paper
to lay the foundation of neural networks.
6
On the Origin of Deep Learning
networks of today: an individual cell is summarizing the stimulation from other selected
linked cells within a grouping, as showed in Figure 1. The joint stimulation from a and c
triggers X, stimulation from b and c triggers Y and stimulation from a and c triggers Z. In
his original illustration, a, b, c stand for simulations, X and Y are outcomes of cells.
With the establishment of how this associative structure of neural grouping can function
as memory, Bain proceeded to describe the construction of these structures. He followed the
directions of associationism and stated that relevant impressions of neural groupings must
be made in temporal contiguity for a period, either on one occasion or repeated occasions.
Further, Bain described the computational properties of neural grouping: connections
are strengthened or weakened through experience via changes of intervening cell-substance.
Therefore, the induction of these circuits would be selected comparatively strong or weak.
As we will see in the following section, Hebb’s postulate highly resembles Bain’s de-
scription, although nowadays we usually label this postulate as Hebb’s, rather than Bain’s,
according to (Wilkes and Wade, 1997). This omission of Bain’s contribution may also be
due to Bain’s lack of confidence in his own theory: Eventually, Bain was not convinced by
himself and doubted about the practical values of neural groupings.
This archaic paragraph can be re-written into modern machine learning languages as the
following:
where ∆wi stands for the change of synaptic weights (wi ) of Neuron i, of which the input
signal is xi . y denotes the postsynaptic response and η denotes learning rate. In other
words, Hebbian Learning Rule states that the connection between two units should be
strengthened as the frequency of co-occurrences of these two units increase.
Although Hebbian Learning Rule is seen as laying the foundation of neural networks,
seen today, its drawbacks are obvious: as co-occurrences appear more, the weights of connec-
tions keep increasing and the weights of a dominant signal will increase exponentially. This
is known as the unstableness of Hebbian Learning Rule (Principe et al., 1999). Fortunately,
these problems did not influence Hebb’s identity as the father of neural networks.
7
Wang and Raj
where t denotes the iteration. A straightforward way to avoid the exploding of weights is
to apply normalization at the end of each iteration, yielding:
wt + ηxi y
wit+1 = Pn i 1
( i=1 (wit + ηxi y)2 ) 2
where n denotes the number of neurons. The above equation can be further expanded into
the following form:
Pn
t w
w yx i i j yxj wj
wit+1 = i + η( + ) + O(η 2 )
Z Z Z3
1
where Z = ( ni wi2 ) 2 . Further, two more assumptions are introduced: 1) η is small.
P
1
Therefore O(η 2 ) is approximately 0. 2) Weights are normalized, therefore Z = ( ni wi2 ) 2 =
P
1.
When these two assumptions were introduced back to the previous equation, Oja’s rule
was proposed as following:
Oja took a step further to show that a neuron that was updated with this rule was
effectively performing Principal Component Analysis on the data. To show this, Oja first
re-wrote Equation 2 as the following forms with two additional assumptions (Oja, 1982):
d t
w = Cwit − ((wit )T Cwit )wit
d(t) i
where C is the covariance matrix of input X. Then he proceeded to show this property
with many conclusions from his another work (Oja and Karhunen, 1985) and linked back
to PCA with the fact that components from PCA are eigenvectors and the first component
is the eigenvector corresponding to largest eigenvalues of the covariance matrix. Intuitively,
we could interpret this property with a simpler explanation: the eigenvectors of C are the
solution when we maximize the rule updating function. Since wit are the eigenvectors of the
covariance matrix of X, we can get that wit are the PCA.
Oja’s learning rule concludes our story of learning rules of the early-stage neural network.
Now we proceed to visit the ideas on neural models.
8
On the Origin of Deep Learning
where y stands for output, xi stands for input of signals, wi stands for the corresponding
weights and zj stands for the inhibitory input. θ stands for the threshold. The function is
designed in a way that the activity of any inhibitory input completely prevents excitation
of the neuron at any time.
Despite the resemblance between MCP Neural Model and modern perceptron, they are
still different distinctly in many different aspects:
• MCP Neural Model is initially built as electrical circuits. Later we will see that the
study of neural networks has borrowed many ideas from the field of electrical circuits.
• The weights of MCP Neural Model wi are fixed, in contrast to the adjustable weights
in modern perceptron. All the weights must be assigned with manual calculation.
• The idea of inhibitory input is quite unconventional even seen today. It might be an
idea worth further study in modern deep learning research.
2.6 Perceptron
With the success of MCP Neural Model, Frank Rosenblatt further substantialized Hebbian
Learning Rule with the introduction of perceptrons (Rosenblatt, 1958). While theorists
like Hebb were focusing on the biological system in the natural environment, Rosenblatt
constructed the electronic device named Perceptron that was showed with the ability to
learn in accordance with associationism.
Rosenblatt (1958) introduced the perceptron with the context of the vision system, as
showed in Figure 2(a). He introduced the rules of the organization of a perceptron as
following:
• Stimuli impact on a retina of the sensory units, which respond in a manner that the
pulse amplitude or frequency is proportional to the stimulus intensity.
• Impulses are transmitted to Projection Area (AI ). This projection area is optional.
9
Wang and Raj
(a) Illustration of organization of a perceptron in (b) A typical perceptron in modern machine learn-
(Rosenblatt, 1958) ing literature
Figure 2(a) illustrates his explanation of perceptron. From left to right, the four units
are sensory unit, projection unit, association unit and response unit respectively. Projection
unit receives the information from sensory unit and passes onto association unit. This unit
is often omitted in other description of similar models. With the omission of projection
unit, the structure resembles the structure of nowadays perceptron in a neural network (as
showed in Figure 2(b)): sensory units collect data, association units linearly adds these data
with different weights and apply non-linear transform onto the thresholded sum, then pass
the results to response units.
One distinction between the early stage neuron models and modern perceptrons is the
introduction of non-linear activation functions (we use sigmoid function as an example
in Figure 2(b)). This originates from the argument that linear threshold function should
be softened to simulate biological neural networks (Bose et al., 1996) as well as from the
consideration of the feasibility of computation to replace step function with a continuous
one (Mitchell et al., 1997).
After Rosenblatt’s introduction of Perceptron, Widrow et al. (1960) introduced a follow-
up model called ADALINE. However, the difference between Rosenblatt’s Perceptron and
ADALINE is mainly on the algorithm aspect. As the primary focus of this paper is neural
network models, we skip the discussion of ADALINE.
10
On the Origin of Deep Learning
To show a more concrete example, we introduce a linear preceptron with only two inputs
x1 and x2 , therefore, the decision boundary w1 x1 + w2 x2 forms a line in a two-dimensional
space. The choice of threshold magnitude shifts the line horizontally and the sign of the
function picks one side of the line as the halfspace the function represents. The halfspace
is showed in Figure 3 (a).
In Figure 3 (b)-(d), we present two nodes a and b to denote to input, as well as the node
to denote the situation when both of them trigger and a node to denote the situation when
neither of them triggers. Figure 3 (b) and Figure 3 (c) show clearly that a linear perceptron
can be used to describe AND and OR operation of these two inputs. However, in Figure 3
(d), when we are interested in XOR operation, the operation can no longer be described by
a single linear decision boundary.
In the next section, we will show that the representation ability is greatly enlarged when
we put perceptrons together to make a neural network. However, when we keep stacking
one neural network upon the other to make a deep learning model, the representation power
will not necessarily increase.
11
Wang and Raj
• Boolean Approximation: an MLP of one hidden layer1 can represent any boolean
function exactly.
• Continuous Approximation: an MLP of one hidden layer can approximate any bounded
continuous function with arbitrary accuracy.
• Arbitrary Approximation: an MLP of two hidden layers can approximate any function
with arbitrary accuracy.
We will discuss these three properties in detail in the following paragraphs. To suit different
readers’ interest, we will first offer an intuitive explanation of these properties and then offer
the proofs.
1. Through this paper, we will follow the most widely accepted naming convention that calls a two-layer
neural network as one hidden layer neural network.
12
On the Origin of Deep Learning
f (x) is dense in the subspace of where it is in. In other words, for an arbitrary function
g(x) in the same subspace as f (x), we have
where > 0. In Equation 3, σ denotes the activation function (a squashing function back
then), wi denotes the weights for the input layer and ωi denotes the weights for the hidden
layer.
This conclusion was drawn with a proof by contradiction: With Hahn-Banach Theorem
and Riesz Representation Theorem, the fact that the closure of f (x) is not all the subspace
where f (x) is in contradicts the assumption that σ is an activation (squashing) function.
Till today, this property has drawn thousands of citations. Unfortunately, many of
the later works cite this property inappropriately (Castro et al., 2000) because Equation 3
is not the widely accepted form of a one-hidden-layer neural network because it does not
deliver a thresholded/squashed output, but a linear output instead. Ten years later after
this property was shown, Castro et al. (2000) concluded this story by showing that when
the final output is squashed, this universal approximation property still holds.
Note that, this property was shown with the context that activation functions are squash-
ing functions. By definition, a squashing function σ : R → [0, 1] is a non-decreasing function
13
Wang and Raj
with the properties limx→∞ σ(x) = 1 and limx→−∞ σ(x) = 0. Many activation functions of
recent deep learning research do not fall into this category.
Before we move on to explain this property, we need first to show a major property regarding
combining linear perceptrons into neural networks. Figure 5 shows that as the number of
linear perceptrons increases to bound the target function, the area outside the polygon with
the sum close to the threshold shrinks. Following this trend, we can use a large number of
perceptrons to bound a circle, and this can be achieved even without knowing the threshold
because the area close to the threshold shrinks to nothing. What left outside the circle is,
in fact, the area that sums to N2 , where N is the number of perceptrons used.
Therefore, a neural network with one hidden layer can represent a circle with arbitrary
diameter. Further, we introduce another hidden layer that is used to combine the outputs of
many different circles. This newly added hidden layer is only used to perform OR operation.
Figure 6 shows an example that when the extra hidden layer is used to merge the circles
from the previous layer, the neural network can be used to approximate any function. The
target function is not necessarily continuous. However, each circle requires a large number
of neurons, consequently, the entire function requires even more.
This property was showed in (Lapedes and Farber, 1988) and (Cybenko, 1988) respec-
tively. Looking back at this property today, it is not arduous to build the connections
between this property to Fourier series approximation, which, in informal words, states
that every function curve can be decomposed into the sum of many simpler curves. With
this linkage, to show this universal approximation property is to show that any one-hidden-
layer neural network can represent one simple surface, then the second hidden layer sums
up these simple surfaces to approximate an arbitrary function.
As we know, one hidden layer neural network simply performs a thresholded sum op-
eration, therefore, the only step left is to show that the first hidden layer can represent a
simple surface. To understand the “simple surface”, with linkage to Fourier transform, one
can imagine one cycle of the sinusoid for the one-dimensional case or a “bump” of a plane
in the two-dimensional case.
14
On the Origin of Deep Learning
Figure 6: How a neural network can be used to approximate a leaf shaped function.
For one dimension, to create a simple surface, we only need two sigmoid functions
appropriately placed, for example, as following:
h
f1 (x) =
1+ e−(x+t1 )
h
f2 (x) =
1 + ex−t2
Then, with f1 (x) + f2 (x), we create a simple surface with height 2h from t1 ≤ x ≤ t2 .
This could be easily generalized to n-dimensional case, where we need 2n sigmoid functions
(neurons) for each simple surface. Then for each simple surface that contributes to the final
function, one neuron is added onto the second hidden layer. Therefore, despite the number
of neurons need, one will never need a third hidden layer to approximate any function.
Similarly to how Gibbs phenomenon affects Fourier series approximation, this approxi-
mation cannot guarantee an exact representation.
The universal approximation properties showed a great potential of shallow neural net-
works at the price of exponentially many neurons at these layers. One followed-up question
is that how to reduce the number of required neurons while maintaining the representation
power. This question motivates people to proceed to deeper neural networks despite that
shallow neural networks already have infinite modeling power. Another issue worth atten-
tion is that, although neural networks can approximate any functions, it is not trivial to
find the set of parameters to explain the data. In the next two sections, we will discuss
these two questions respectively.
15
Wang and Raj
The universal approximation properties of shallow neural networks come at a price of expo-
nentially many neurons and therefore are not realistic. The question about how to maintain
this expressive power of the network while reducing the number of computation units has
been asked for years. Intuitively, Bengio and Delalleau (2011) suggested that it is nature
to pursue deeper networks because 1) human neural system is a deep architecture (as we
will see examples in Section 5 about human visual cortex.) and 2) humans tend to rep-
resent concepts at one level of abstraction as the composition of concepts at lower levels.
Nowadays, the solution is to build deeper architectures, which comes from a conclusion that
states the representation power of a k layer neural network with polynomial many neurons
need to be expressed with exponentially many neurons if a k − 1 layer structured is used.
However, theoretically, this conclusion is still being completed.
This conclusion could trace back to three decades ago when Yao (1985) showed the
limitations of shallow circuits functions. Hastad (1986) later showed this property with
parity circuits: “there are functions computable in polynomial size and depth k but requires
exponential size when depth is restricted to k − 1”. He showed this property mainly by
the application of DeMorgan’s law, which states that any AND or ORs can be rewritten
as OR of ANDs and vice versa. Therefore, he simplified a circuit where ANDs and ORs
appear one after the other by rewriting one layer of ANDs into ORs and therefore merge
this operation to its neighboring layer of ORs. By repeating this procedure, he was able to
represent the same function with fewer layers, but more computations.
Moving from circuits to neural networks, Delalleau and Bengio (2011) compared deep
and shallow sum-product neural networks. They showed that a function√ that could be
expressed with O(n) neurons on a network of depth k required at least O(2 n ) and O((n −
1)k ) neurons on a two-layer neural network.
Further, Bianchini and Scarselli (2014) extended this study to a general neural net-
work with many major activation functions including tanh and sigmoid. They derived
the conclusion with the concept of Betti numbers, and used this number to describe the
representation power of neural networks. They showed that for a shallow network, the rep-
resentation power can only grow polynomially with respect to the number of neurons, but
for deep architecture, the representation can grow exponentially with respect to the number
of neurons. They also related their conclusion to VC-dimension of neural networks, which
is O(p2 ) for tanh (Bartlett and Maass, 2003) where p is the number of parameters.
Recently, Eldan and Shamir (2015) presented a more thorough proof to show that depth
of a neural network is exponentially more valuable than the width of a neural network, for a
standard MLP with any popular activation functions. Their conclusion is drawn with only a
few weak assumptions that constrain the activation functions to be mildly increasing, mea-
surable, and able to allow shallow neural networks to approximate any univariate Lipschitz
function. Finally, we have a well-grounded theory to support the fact that deeper network
is preferred over shallow ones. However, in reality, many problems will arise if we keep
increasing the layers. Among them, the increased difficulty of learning proper parameters
is probably the most prominent one. Immediately in the next section, we will discuss the
main drive of searching parameters for a neural network: Backpropagation.
16
On the Origin of Deep Learning
• The number of input neurons cannot smaller than the classes/patterns of data.
However, their approaches may not be relevant anymore as they require the data to be
linearly separable, under which condition that many other models can be applied.
17
Wang and Raj
was able to overfit the data. Therefore, this phenomenon should have played a critical role
in the research of improving the optimization techniques. Recently, the studying of cost
surfaces of neural networks have indicated the existence of saddle points (Choromanska
et al., 2015; Dauphin et al., 2014; Pascanu et al., 2014), which may explain the findings of
Brady et al back in the late 80s.
Backpropagation enables the optimization of deep neural networks. However, there is
still a long way to go before we can optimize it well. Later in Section 7, we will briefly
discuss more techniques related to the optimization of neural networks.
18
On the Origin of Deep Learning
Figure 7: Trade off of representation power and computation complexity of several models,
that guides the development of better models
With the background of how modern neural network is set up, we proceed to visit the
each prominent branch of current deep learning family. Our first stop is the branch that
leads to the popular Restricted Boltzmann Machines and Deep Belief Nets, and it starts as
a model to understand the data unsupervisedly.
Figure 7 summarizes the model that will be covered in this Section. The horizontal axis
stands for the computation complexity of these models while the vertical axis stands for the
representation power. The six milestones that will be focused in this section are placed in
the figure.
19
Wang and Raj
represent data while the lower circles denote the data. There is no connection between the
nodes in SOM 2 .
The position of each node is fixed. The representation should not be viewed as only a
numerical value. Instead, the position of it also matters. This property is different from
some widely-accepted representation criterion. For example, we compare the case when
one-hot vector and one-dimension SOM are used to denote colors: To denote green out
of a set: C = {green, red, purple}, one-hot representation can use any vector of (1, 0, 0),
(0, 1, 0) or (0, 0, 1) as long as we specify the bit for green correspondingly. However, for a
one-dimensional SOM, only two vectors are possible: (1, 0, 0) or (0, 0, 1). This is because
that, since SOM aims to represent the data while retaining the similarity; and red and
purple are much more similar than green and red or green and purple, green should not be
represented in a way that it splits red and purple. One should notice that, this example is
only used to demonstrate that the position of each unit in SOM matters. In practice, the
values of SOM unit are not restricted to integers.
The learned SOM is usually a good tool for visualizing data. For example, if we conduct
a survey on the happiness level and richness level of each country and feed the data into
a two-dimensional SOM. Then the trained units should represent the happiest and richest
country at one corner and represent the opposite country at the furthest corner. The rest
two corners represent the richest, yet unhappiest and the poorest but happiest countries.
The rest countries are positioned accordingly. The advantage of SOM is that it allows one
2. In some other literature, (Bullinaria, 2004) as an example, one may notice that there are connections
in the illustrations of models. However, those connections are only used to represent the neighborhood
relationship of nodes, and there is no information flowing via those connections. In this paper, as we
will show many other models that rely on a clear illustration of information flow, we decide to save the
connections to denote that.
20
On the Origin of Deep Learning
to easily tell how a country is ranked among the world with a simple glance of the learned
units (Guthikonda, 2005).
With an understanding of the representation power of SOM, now we proceed to its param-
eter learning algorithm. The classic algorithm is heuristic and intuitive, as shown below:
Here we use a two-dimensional SOM as example, and i, j are indexes of units; w is weight
of the unit; v denotes data vector; k is the index of data; t denotes the current iteration; N
constrains the maximum number of steps allowed; P (·) denotes the penalty considering the
distance between unit p, q and unit i, j; l is learning rate; r denotes a radius used to select
neighbor nodes. Both l and r typically decrease as t increases. || · ||22 denotes Euclidean
distance and dist(·) denotes the distance on the position of units.
This algorithm explains how SOM can be used to learn a representation and how the
similarities are retained as it always selects a subset of units that are similar with the data
sampled and adjust the weights of units to match the data sampled.
However, this algorithm relies on a careful selection of the radius of neighbor selection
and a good initialization of weights. Otherwise, although the learned weights will have a
local property of topological similarity, it loses this property globally: sometimes, two similar
clusters of similar events are separated by another dissimilar cluster of similar events. In
simpler words, units of green may actually separate units of red and units of purple if the
network is not appropriately trained. (Germano, 1999).
3. The term “recurrent” is very confusing nowadays because of the popularity recurrent neural network
(RNN) gains.
21
Wang and Raj
The magnetic system will evolve until this potential energy is minimum.
22
On the Origin of Deep Learning
where s is the state of a unit, b denotes the bias; w denotes the bidirectional weights and i, j
are indexes of units. This energy function closely connects to the potential energy function
of spin glass, as showed in Equation 4.
Hopfield Network is typically applied to memorize the state of data. The weights of a
network are designed or learned to make sure that the energy is minimized given the state
of interest. Therefore, when another state presented to the network, while the weights are
fixed, Hopfield Network can search for the states that minimize the energy and recover the
state in memory. For example, in a face completion task, when some image of faces are
presented to Hopfield Network (in a way that each unit of the network corresponds to each
pixel of one image, and images are presented one after the other), the network can calculate
the weights to minimize the energy given these faces. Later, if one image is corrupted or
distorted and presented to this network again, the network is able to recover the original
image by searching a configuration of states to minimize the energy starting from corrupted
input presented.
The term “energy” may remind people of physics. To explain how Hopfield Network
works in a physics scenario will be clearer: nature uses Hopfield Network to memorize the
equilibrium position of a pendulum because, in an equilibrium position, the pendulum has
the lowest gravitational potential energy. Therefore, whenever a pendulum is placed, it will
converge back to the equilibrium position.
23
Wang and Raj
network will invert the state and proceed to test the next unit. This procedure is called
Asynchronous update and this procedure is obviously subject to the sequential order of
selection of units. A counterpart is known as Synchronous update when the network first
tests for all the units and then inverts all the unit-to-invert simultaneously. Both of these
methods may lead to a local optimal. Synchronous update may even result in an increasing
of energy and may converge to an oscillation or loop of states.
4.2.4 Capacity
One distinct disadvantage of Hopfield Network is that it cannot keep the memory very
efficient because a network of N units can only store memory up to 0.15N 2 bits. While a
network with N units has N 2 edges. In addition, after storing M memories (M instances
of data), each connection has an integer value in range [−M, M ]. Thus, the number of bits
required to store N units are N 2 log(2M + 1) (Hopfield, 1982). Therefore, we can safely
draw the conclusion that although Hopfield Network is a remarkable idea that enables the
network to memorize data, it is extremely inefficient in practice.
As follow-ups of the invention of Hopfield Network, many works are attempted to study
and increase the capacity of original Hopfield Network (Storkey, 1997; Liou and Yuan,
1999; Liou and Lin, 2006). Despite these attempts made, Hopfield Network still gradually
fades out of the society. It is replaced by other models that are inspired by it. Immediately
following this section, we will discuss the popular Boltzmann Machine and Restricted Boltz-
mann Machine and study how these models are upgraded from the initial ideas of Hopfield
Network and evolve to replace it.
Es
F (s) ∝ e− kT
where s stands for the state and Es is the corresponding energy. k and T are Boltzmann’s
constant and thermodynamic temperature respectively. Naturally, the ratio of two distri-
bution is only characterized by the difference of energies, as following:
24
On the Origin of Deep Learning
Figure 10: Illustration of Boltzmann Machine. With the introduction of hidden units
(shaded nodes), the model conceptually splits into two parts: visible units and
hidden units. The red dashed line is used to highlight the conceptual separation.
With how the distribution is specified by the energy, the probability is defined as the
term of each state divided by a normalizer, as following:
Esi
e− kT
psi =
P − Esj
je
kT
25
Wang and Raj
algorithm called Simulated Annealing (Khachaturyan et al., 1979; Aarts and Korst, 1988)
back then, but Simulated Annealing is hardly relevant to nowadays deep learning society.
Regardless of the historical importance that the term T introduces, within this section, we
will assume T = 1 as a constant, for the sake of simplification.
where v stands for visible units, h stands for hidden units. This equation also connects
back to Equation 4, except that Boltzmann Machine splits the energy function according
to hidden units and visible units.
Based on this energy function, the probability of a joint configuration over both visible
unit the hidden unit can be defined as following:
e−E(v,h)
p(v, h) = P −E(m,n)
m,n e
The probability of visible/hidden units can be achieved by marginalizing this joint proba-
bility.
For example, by marginalizing out hidden units, we can get the probability distribution
of visible units:
P −E(v,h)
e
p(v) = P h −E(m,n)
m,n e
26
On the Origin of Deep Learning
function is usually performed to determine the parameters. For simplicity, the following
derivation is based on a single observation.
First, we have the log likelihood function of visible units as
X X
l(v; w) = log p(v; w) = log e−Ev,h − log e−Em,n
h m,n
27
Wang and Raj
Figure 11: Illustration of Restricted Boltzmann Machine. With the restriction that there is
no connections between hidden units (shaded nodes) and no connections between
visible units (unshaded nodes), the Boltzmann Machine turns into a Restricted
Boltzmann Machine. The model now is a a bipartite graph.
∂l(v; w)
= − < si , sj >p0 + < si , sj >p∞ (8)
∂w
here we use p0 to denote data distribution and p∞ to denote model distribution. Other
notations remain unchanged. Therefore, the difficulty of mentioned methods to learn the
parameters is that it requires potentially “infinitely” many sampling steps to approximate
the model distribution.
Hinton (2002) overcame this issue magically, with the introduction of a method named
Contrastive Divergence. Empirically, he found that one does not have to perform “infinitely”
many sampling steps to converge to the model distribution, a finite k steps of sampling is
enough. Therefore, Equation 8 is effectively re-written into:
∂l(v; w)
= − < si , sj >p0 + < si , sj >pk
∂w
Remarkably, Hinton (2002) showed that k = 1 is sufficient for the learning algorithm to
work well in practice.
Carreira-Perpinan and Hinton (2005) attempted to justify Contrastive Divergence in
theory, but their derivation led to a negative conclusion that Contrastive Divergence is a
28
On the Origin of Deep Learning
Figure 12: Illustration of Deep Belief Networks. Deep Belief Networks is not just stacking
RBM together. The bottom layers (layers except the top one) do not have the
bi-directional connections, but only connections top down.
biased algorithm, and a finite k cannot represent the model distribution. However, their
empirical results suggested that finite k can approximate the model distribution well enough,
resulting a small enough bias. In addition, the algorithm works well in practice, which
strengthened the idea of Contrastive Divergence.
With the reasonable modeling power and a fast approximation algorithm, RBM quickly
draws great attention and becomes one of the most fundamental building blocks of deep
neural networks. In the following two sections, we will introduce two distinguished deep
neural networks that are built based on RBM/Boltzmann Machine, namely Deep Belief
Nets and Deep Boltzmann Machine.
8. This paper is generally seen as the opening of nowadays Deep Learning era, as it first introduces the
possibility of training a deep neural network by layerwise training
29
Wang and Raj
• Bengio et al. (2007) suggested that unsupervised pre-training initializes the model to
a point in parameter space which leads to a more effective optimization process, that
the optimization can find a lower minimum of the empirical cost function.
• Erhan et al. (2010) empirically argued for a regularization explanation, that unsu-
pervised pretraining guides the learning towards basins of attraction of minima that
support better generalization from the training data set.
In addition to Deep Belief Networks, this pretraining mechanism also inspires the pre-
training for many other classical models, including the autoencoders (Poultney et al., 2006;
Bengio et al., 2007), Deep Boltzmann Machines (Salakhutdinov and Hinton, 2009) and some
models inspired by these classical models like (Yu et al., 2010).
After the pre-training is performed, fine-tuning is carried out to further optimize the net-
work to search for the parameters that lead to a lower minimum. For Deep Belief Networks,
there are two different fine tuning strategies dependent on the goals of the network.
Fine Tuning for Generative Model Fine-tuning for a generative model is achieved
with a contrastive version of wake-sleep algorithm (Hinton et al., 1995). This algorithm is
intriguing for the reason that it is designed to interpret how the brain works. Scientists have
found that sleeping is a critical process of brain function and it seems to be an inverse version
of how we learn when we are awake. The wake-sleep algorithm also has two steps. In wake
phase, we propagate information bottom up to adjust top-down weights for reconstructing
the layer below. Sleep phase is the inverse of wake phase. We propagate the information
top down to adjust bottom-up weights for reconstructing the layer above.
The contrastive version of this wake-sleep algorithm is that we add one Contrastive
Divergence phase between wake phase and sleep phase. The wake phase only goes up to the
visible layer of the top RBM, then we sample the top RBM with Contrastive Divergence,
then a sleep phase starts from the visible layer of top RBM.
Fine Tuning for Discriminative Model The strategy for fine tuning a DBN as a
discriminative model is to simply apply standard backpropagation to pre-trained model
30
On the Origin of Deep Learning
Figure 13: Illustration of Deep Boltzmann Machine. Deep Boltzmann Machine is more like
stacking RBM together. Connections between every two layers are bidirectional.
since we have labels of data. However, pre-training is still necessary in spite of the generally
good performance of backpropagation.
X N X
X X N
X −1 X
E(v, h) = − vi bi − hn,k bn,k − vi wik hk − hn,k wn,k,l hn+1,l
i n=1 k i,k n=1 k,l
4.6.1 Deep Boltzmann Machine (DBM) v.s. Deep Belief Networks (DBN)
As their acronyms suggest, Deep Boltzmann Machine and Deep Belief Networks have many
similarities, especially from the first glance. Both of them are deep neural networks origi-
nates from the idea of Restricted Boltzmann Machine. (The name “Deep Belief Network”
31
Wang and Raj
seems to indicate that it also partially originates from Bayesian Network (Krieg, 2001).)
Both of them also rely on layerwise pre-training for a success of parameter learning.
However, the fundamental differences between these two models are dramatic, intro-
duced by how the connections are made between bottom layers (un-directed/bi-directed
v.s. directed). The bidirectional structure of DBM grants the possibility of DBM to learn
a more complex pattern of data. It also grants the possibility for the approximate inference
procedure to incorporate top-down feedback in addition to an initial bottom-up pass, allow-
ing Deep Boltzmann Machines to better propagate uncertainty about ambiguous inputs.
32
On the Origin of Deep Learning
• Retina converts the light energy that comes from the rays bouncing off of an object
into chemical energy. This chemical energy is then converted into action potentials
that are transferred onto primary visual cortex. (In fact, there are several other
brain structures involved between retina and V1, but we omit these structures for
simplicity9 .)
9. We deliberately discuss the components that have connections with established technologies in convo-
lutional neural network, one who is interested in developing more powerful models is encouraged to
investigate other components.
33
Wang and Raj
Figure 14: A brief illustration of ventral stream of the visual cortex in human vision system.
It consists of primary visual cortex (V1), visual areas (V2 and V4) and inferior
temporal gyrus.
• Primary visual cortex (V1) mainly fulfills the task of edge detection, where an edge
is an area with strongest local contrast in the visual signals.
• V2, also known as secondary visual cortex, is the first region within the visual as-
sociation area. It receives strong feedforward connections from V1 and sends strong
connections to later areas. In V2, cells are tuned to extract mainly simple properties
of the visual signals such as orientation, spatial frequency, and colour, and a few more
complex properties.
• Inferior temporal gyrus (TI) is responsible for identifying the object based on the color
and form of the object and comparing that processed information to stored memories
of objects to identify that object (Kolb et al., 2014). In other words, IT performs the
semantic level tasks, like face recognition.
Many of the descriptions of functions about visual cortex should revive a recollection
of convolutional neural networks for the readers that have been exposed to some relevant
technical literature. Later in this section, we will discuss more details about convolutional
neural networks, which will help build explicit connections. Even for readers that barely
34
On the Origin of Deep Learning
have knowledge in convolutional neural networks, this hierarchical structure of visual cortex
should immediately ring a bell about neural networks.
Besides convolutional neural networks, visual cortex has been inspiring the works in
computer vision for a long time. For example, Li (1998) built a neural model inspired
by the primary visual cortex (V1). In another granularity, Serre et al. (2005) introduced
a system with feature detections inspired from the visual cortex. De Ladurantaye et al.
(2012) published a book describing the models of information processing in the visual cortex.
Poggio and Serre (2013) conducted a more comprehensive survey on the relevant topic, but
they didn’t focus on any particular subject in detail in their survey. In this section, we
discuss the connections between visual cortex and convolutional neural networks in details.
We will begin with Neocogitron, which borrows some ideas from visual cortex and later
inspires convolutional neural network.
Neocogitron, proposed by Fukushima (1980), is generally seen as the model that inspires
Convolutional Neural Networks on the computation side. It is a neural network that con-
sists of two different kinds of layers (S-layer as feature extractor and C-layer as structured
connections to organize the extracted features.)
S-layer consists of a number of S-cells that are inspired by the cell in primary visual
cortex. It serves as a feature extractor. Each S-cell can be ideally trained to be responsive
to a particular feature presented in its receptive field. Generally, local features such as edges
in particular orientations are extracted in lower layers while global features are extracted
in higher layers. This structure highly resembles how human conceive objects. C-layer
resembles complex cell in the higher pathway of visual cortex. It is mainly introduced for
shift invariant property of features extracted by S-layer.
During parameter learning process, only the parameters of S-layer are updated. Neocog-
itron can also be trained unsupervisedly, for a good feature extractor out of S-layers. The
training process for S-layer is very similar to Hebbian Learning rule, which strengthens the
connections between S-layer and C-layer for whichever S-cell shows the strongest response.
This training mechanism also introduces the problem Hebbian Learning rule introduces,
that the strength of connections will saturate (since it keeps increasing). The solution was
also introduced by Fukushima (1980), which was introduced with the name “inhibitory
cell”. It performed the function as a normalization to avoid the problem.
Now we proceed from Neocogitron to Convolutional Neural Network. First, we will in-
troduce the building components: convolutional layer and subsampling layer. Then we
assemble these components to present Convolutional Neural Network, using LeNet as an
example.
35
Wang and Raj
and denoted as h = f ? g.
Convolutional neural network typically works with two-dimensional convolution opera-
tion, which could be summarized in Figure 15.
As showed in Figure 15, the leftmost matrix is the input matrix. The middle one is
usually called a kernel matrix. Convolution is applied to these matrices and the result
is showed as the rightmost matrix. The convolution process is an element-wise product
followed by a sum, as showed in the example. When the left upper 3×3 matrix is convoluted
with the kernel, the result is 29. Then we slide the target 3 × 3 matrix one column right,
convoluted with the kernel and get the result 12. We keep sliding and record the results as
a matrix. Because the kernel is 3 × 3, every target matrix is 3 × 3, thus, every 3 × 3 matrix
is convoluted to one digit and the whole 5 × 5 matrix is shrunk into 3 × 3 matrix. (Because
5 − (3 − 1) = 3. The first 3 means the size of the kernel matrix. )
One should realize that convolution is locally shift invariant, which means that for many
different combinations of how the nine numbers in the upper 3 × 3 matrix are placed, the
convoluted result will be 29. This invariant property plays a critical role in vision problem
because that in an ideal case, the recognition result should not be changed due to shift or
rotation of features. This critical property is used to be solved elegantly by Lowe (1999);
Bay et al. (2006), but convolutional neural network brought the performance up to a new
level.
36
On the Origin of Deep Learning
Figure 16: Convolutional kernels example. Different kernels applied to the same image will
result in differently processed images. Note that there is a 91 divisor applied to
these kernels.
37
Wang and Raj
Figure 17: An illustration of LeNet, where Conv stands for convolutional layer and Sam-
pling stands for SubSampling Layer.
many well-trained popular models can usually perform well in other tasks with only limited
fine-tuning process: the kernels have been well trained and can be universally applicable.
With the understanding of the essential role convolution operation plays in vision tasks,
we proceed to investigate some major milestones along the way.
f (x) = max(0, x)
which is a transform that removes the negative part of the input, resulting in a clearer
contrast of meaningful features as opposed to other side product the kernel produces.
Therefore, this non-linearity grants the convolution more power in extracting useful
features and allows it to simulate the functions of visual cortex more closely.
38
On the Origin of Deep Learning
5.4.3 LeNet
With the two most important components introduced, we can stack them together to as-
semble a convolutional neural network. Following the recipe of Figure 17, we will end up
with the famous LeNet.
LeNet is known as its ability to classify digits and can handle a variety of different
problems of digits including variances in position and scale, rotation and squeezing of digits,
and even different stroke width of the digit. Meanwhile, with the introduction of LeNet,
LeCun et al. (1998b) also introduces the MNIST database, which later becomes the standard
benchmark in digit recognition field.
5.5.1 AlexNet
While LeNet is the one that starts the era of convolutional neural networks, AlexNet,
invented by Krizhevsky et al. (2012), is the one that starts the era of CNN used for ImageNet
classification. AlexNet is the first evidence that CNN can perform well on this historically
difficult ImageNet dataset and it performs so well that leads the society into a competition
of developing CNNs.
The success of AlexNet is not only due to this unique design of architecture but also
due to the clever mechanism of training. To avoid the computationally expensive training
process, AlexNet has been split into two streams and trained on two GPUs. It also used
data augmentation techniques that consist of image translations, horizontal reflections, and
patch extractions.
The recipe of AlexNet is shown in Figure 18. However, rarely any lessons can be
learned from the architecture of AlexNet despite its remarkable performance. Even more
unfortunately, the fact that this particular architecture of AlexNet does not have a well-
grounded theoretical support pushes many researchers to blindly burn computing resources
39
Wang and Raj
to test for a new architecture. Many models have been introduced during this period, but
only a few may be worth mentioning in the future.
5.5.2 VGG
In the blind competition of exploring different architectures, Simonyan and Zisserman (2014)
showed that simplicity is a promising direction with a model named VGG. Although VGG
is deeper (19 layer) than other models around that time, the architecture is extremely
simplified: all the layers are 3 × 3 convolutional layer with a 2 × 2 pooling layer. This
simple usage of convolutional layer simulates a larger filter while keeping the benefits of
smaller filter sizes, because the combination of two 3×3 convolutional layers has an effective
receptive field of a 5 × 5 convolutional layer, but with fewer parameters.
The spatial size of the input volumes at each layer will decrease as a result of the
convolutional and pooling layers, but the depth of the volumes increases because of the
increased number of filters (in VGG, the number of filters doubles after each pooling layer).
This behavior reinforces the idea of VGG to shrink spatial dimensions, but grow depth.
VGG is not the winner of the ImageNet competition of that year (The winner is
GoogLeNet invented by Szegedy et al. (2015)). GoogLeNet introduced several important
concepts like Inception module and the concept later used by R-CNN (Girshick et al., 2014;
Girshick, 2015; Ren et al., 2015), but the arbitrary/creative design of architecture barely
contribute more than what VGG does to the society, especially considering that Residual
Net, following the path of VGG, won the ImageNet challenge in an unprecedented level.
40
On the Origin of Deep Learning
increasing the number of layers will only result in worse results, for both training cases and
testing cases (He et al., 2015).
The breakthrough ResNet introduces, which allows ResNet to be substantially deeper
than previous networks, is called Residual Block. The idea behind a Residual Block is
that some input of a certain layer (denoted as x) can be passed to the component two
layers later either following the traditional path which involves convolutional layers and
ReLU transform succession (we denote the result as f (x)), or going through an express way
that directly passes x there. As a result, the input to the component two layers later is
f (x) + x instead of what is typically seen as f (x). The idea of Residual Block is illustrated
in Figure 19.
In a complementary work, He et al. (2016) validated that residual blocks are essential
for propagating information smoothly, therefore simplifies the optimization. They also
extended the ResNet to a 1000-layer version with success on CIFAR data set.
Another interesting perspective of ResNet is provided by (Veit et al., 2016). They
showed that ResNet behave behaves like ensemble of shallow networks: the express way
introduced allows ResNet to perform as a collection of independent networks, each network
is significantly shallower than the integrated ResNet itself. This also explains why gradient
can be passed through the ultra-deep architecture without being vanished. (We will talk
more about vanishing gradient problem when we discuss recurrent neural network in the
next section.) Another work, which is not directly relevant to ResNet, but may help to
understand it, is conducted by Hariharan et al. (2015). They showed that features from
lower layers are informative in addition to what can be summarized from the final layer.
ResNet is still not completely vacant from clever designs. The number of layers in the
whole network and the number of layers that Residual Block allows identity to bypass are
still choices that require experimental validations. Nonetheless, to some extent, ResNet
has shown that critical reasoning can help the development of CNN better than blind
41
Wang and Raj
experimental trails. In addition, the idea of Residual Block has been found in the actual
visual cortex (In the ventral stream of the visual cortex, V4 can directly accept signals from
primary visual cortex), although ResNet is not designed according to this in the first place.
With the introduction of these state-of-the-art neural models that are successful in these
challenges, Canziani et al. (2016) conducted a comprehensive experimental study comparing
these models. Upon comparison, they showed that there is still room for improvement on
fully connected layers that show strong inefficiencies for smaller batches of images.
42
On the Origin of Deep Learning
Figure 20: Illustrations of some mistakes of neural networks. (a)-(d) (from (Szegedy et al.,
2013)) are adversarial images that are generated based on original images. The
differences between these and the original ones are un-observable by naked eye,
but the neural network can successfully classify original ones but fail adversarial
ones. (e)-(h) (from (Nguyen et al., 2015)) are patterns that are generated. A
neural network classify them into (e) school bus, (f) guitar, (g) peacock and (h)
Pekinese respectively.
misclassification with high confidence. Null space works like a blind spot to a matrix and
changes within null space are never sensible to the corresponding matrix.
This blind spot should not discourage the promising future of neural networks. On the
contrary, it makes the convolutional neural network resemble the human vision system in a
deeper level. In the human vision system, blind spots (Gregory and Cavanagh, 2011) also
exist (Wandell, 1995). Interesting work might be seen about linking the flaws of human
vision system to the defects of neural networks and helping to overcome these defects in the
future.
43
Wang and Raj
Figure 21: Some failed images of ImageNet classification by ResNet and the primary label
associated with the image.
To further improve the performance ResNet reached, one direction might be to modeling
the annotators’ labeling preference. One assumption could be that annotators prefer to label
an image to make it distinguishable. Some established work to modeling human factors
(Wilson et al., 2015) could be helpful.
However, the more important question is that whether it is worth optimizing the model
to increase the testing results on ImageNet dataset, since remaining misclassifications may
not be a result of the incompetency of the model, but problems of annotations.
The introduction of other data sets, like COCO (Lin et al., 2014), Flickr (Plummer et al.,
2015), and VisualGenome (Krishna et al., 2016) may open a new era of vision problems with
more competitive challenges. However, the fundamental problems and experiences that this
section introduces should never be forgotten.
44
On the Origin of Deep Learning
45
Wang and Raj
Figure 22: The difference of recurrent structure from Jordan Network and Elman Network.
46
On the Origin of Deep Learning
Figure 23: The unfolded structured of BRNN. The temporal order is from left to right.
Hidden layer 1 is unfolded in the standard way of an RNN. Hidden layer 2 is
unfolded to simulate the reverse connection.
Bidirectional Recurrent Neural Network (BRNN) was invented by Schuster and Paliwal
(1997) with the goal to introduce a structure that was unfolded to be a bidirectional neural
network. Therefore, when it is applied to time series data, not only the information can
be passed following the natural temporal sequences, but the further information can also
reversely provide knowledge to previous time steps.
Figure 23 shows the unfolded structure of a BRNN. Hidden layer 1 is unfolded in the
standard way of an RNN. Hidden layer 2 is unfolded to simulate the reverse connection.
Transparency (in Figure 23) is applied to emphasize that unfolding an RNN is only a concept
that is used for illustration purpose. The actual model handles data from different time
steps with the same single model.
BRNN is formulated as following:
where the subscript 1 and 2 denote the variables associated with hidden layer 1 and 2
respectively.
With the introduction of “recurrent” connections back from the future, Backpropaga-
tion through Time is no longer directly feasible. The solution is to treat this model as a
combination of two RNNs: a standard one and a reverse one, then apply BPTT onto each
of them. Weights are updated simultaneously once two gradients are computed.
47
Wang and Raj
that is designed to overcome vanishing gradient problem, with the help of a special designed
memory cell. Nowadays, “LSTM” is widely used to denote any recurrent network that
with that memory cell, which is nowadays referred as an LSTM cell.
LSTM was introduced to overcome the problem that RNNs cannot long term dependen-
cies (Bengio et al., 1994). To overcome this issue, it requires the specially designed memory
cell, as illustrated in Figure 24 (a).
LSTM consists of several critical components.
• states: values that are used to offer the information for output.
• gates: values that are used to decide the information flow of states.
? input gate: it decides whether input state enters internal state. It is denoted as
g, and we have:
g t = σ(Wgi it ) (10)
? forget gate: it decides whether internal state forgets the previous internal state.
It is denoted as f , and we have:
f t = σ(Wf i it ) (11)
? output gate: it decides whether internal state passes its value to output and
hidden state of next time step. It is denoted as o and we have:
ot = σ(Woi it ) (12)
Finally, considering how gates decide the information flow of states, we have the last two
equations to complete the formulation of LSTM:
mt =g t it + f t mt−1 (13)
ht =ot mt (14)
48
On the Origin of Deep Learning
(a) LSTM “memory” cell (b) Input data and previous hidden
state form into input state
(c) Calculating input gate and forget (d) Calculating output gate
gate
(e) Update internal state (f) Output and update hidden state
49
Wang and Raj
input gate and forget gate are computed, as described in Equation 10 and Equation 11.
Figure 24 (d) shows how output gate is computed, as described in Equation 12. Figure 24
(e) shows how internal state is updated, as described in Equation 13. Figure 24 (f) shows
how output and hidden state are updated, as described in Equation 14.
All the weights are parameters that need to be learned during training. Therefore,
theoretically, LSTM can learn to memorize long time dependency if necessary and can
learn to forget the past when necessary, making itself a powerful model.
With this important theoretical guarantee, many works have been attempted to improve
LSTM. For example, Gers and Schmidhuber (2000) added a peephole connection that allows
the gate to use information from the internal state. Cho et al. (2014) introduced the Gated
Recurrent Unit, known as GRU, which simplified LSTM by merging internal state and
hidden state into one state, and merging forget gate and input gate into a simple update
gate. Integrating LSTM cell into bidirectional RNN is also an intuitive follow-up to look
into (Graves et al., 2013).
Interestingly, despite the novel LSTM variants proposed now and then, Greff et al.
(2015) conducted a large-scale experiment investigating the performance of LSTMs and got
the conclusion that none of the variants can improve upon the standard LSTM architecture
significantly. Probably, the improvement of LSTM is in another direction rather than
updating the structure inside a cell. Attention models seem to be a direction to go.
Attention Models are loosely based on a bionic design to simulate the behavior of human
vision attention mechanism: when humans look at an image, we do not scan it bit by bit
or stare at the whole image, but we focus on some major part of it and gradually build the
context after capturing the gist. Attention mechanisms were first discussed by Larochelle
and Hinton (2010) and Denil et al. (2012). The attention models mostly refer to the models
that were introduced in (Bahdanau et al., 2014) for machine translation and soon applied to
many different domains like (Chorowski et al., 2015) for speech recognition and (Xu et al.,
2015) for image caption generation.
Attention models are mostly used for sequence output prediction. Instead of seeing the
whole sequential data and make one single prediction (for example, language model), the
model needs to make a sequential prediction for the sequential input for tasks like machine
translation or image caption generation. Therefore, the attention model is mostly used to
answer the question on where to pay attention to based on previously predicted labels or
hidden states.
The output sequence may not have to be linked one-to-one to the input sequence, and the
input data may not even be a sequence. Therefore, usually, an encoder-decoder framework
(Cho et al., 2015) is necessary. The encoder is used to encode the data into representations
and decoder is used to make sequential predictions. Attention mechanism is used to locate
a region of the representation for predicting the label in current time step.
Figure 25 shows a basic attention model under encoder-decoder network structure. The
representation encoder encodes is all accessible to attention model, and attention model only
selects some regions to pass onto the LSTM cell for further usage of prediction making.
50
On the Origin of Deep Learning
Figure 25: The unfolded structured of an attention model. Transparency is used to show
that unfolding is only conceptual. The representation encoder learns are all
available to the decoder across all time steps. Attention module only selects
some to pass onto LSTM cell for prediction.
Therefore, all the magic of attention models is about how this attention module in
Figure 25 helps to localize the informative representations.
To formalize how it works, we use r to denote the encoded representation (there is a
total of M regions of representation), use h to denote hidden states of LSTM cell. Then,
the attention module can generate the unscaled weights for ith region of the encoded rep-
resentation as:
βit = f (ht−1 , r, {αjt−1 }M
j=1 )
where αjt−1 is the attention weights computed at the previous time step, and can be com-
puted at current time step as a simple softmax function:
exp(βit )
αit = PM
t
j exp(βj )
Therefore, we can further use the weights α to reweight the representation r for prediction.
There are two ways for the representation to be reweighted:
• Soft attention: The result is a simple weighted sum of the context vectors such that:
M
X
t
r = αjt cj
j
• Hard attention: The model is forced to make a hard decision by only localizing one
region: sampling one region out following multinoulli distribution.
51
Wang and Raj
(a) Deep input architecture (b) Deep recurrent architec- (c) Deep output architecture
ture
One problem about hard attention is that sampling out of multinoulli distribution is
not differentiable. Therefore, the gradient based method can be hardly applied. Variational
methods (Ba et al., 2014) or policy gradient based method (Sutton et al., 1999) can be
considered.
In this very last section of the evolutionary path of RNN family, we will visit some ideas
that have not been fully explored.
Although recurrent neural network suffers many issues that deep neural network has because
of the recurrent connections, current RNNs are still not deep models regarding representa-
tion learning compared to models in other families.
Pascanu et al. (2013a) formalizes the idea of constructing deep RNNs by extending
current RNNs. Figure 26 shows three different directions to construct a deep recurrent
neural network by increasing the layers of the input component (Figure 26 (a)), recurrent
component (Figure 26 (b)) and output component (Figure 26 (c)) respectively.
52
On the Origin of Deep Learning
53
Wang and Raj
θt+1 = θt − η5tθ
7.1.1 Rprop
Rprop was introduced by Riedmiller and Braun (1993). It is a unique method even studied
back today as it does not fully utilize the information of gradient, but only considers the
sign of it. In other words, it updates the parameters following:
7.1.2 AdaGrad
AdaGrad was introduced by Duchi et al. (2011). It follows the idea of introducing an
adaptive learning rate mechanism that assigns higher learning rate to the parameters that
have been updated more mildly and assigns lower learning rate to the parameters that have
been updated dramatically. The measure of the degree of the update applied is the `2
norm of historical gradients, S t = ||51θ , 52θ , ... 5tθ ||22 , therefore we have the update rule as
following:
η
θt+1 = θt − 5t
St + θ
where is small term to avoid η divided by zero.
54
On the Origin of Deep Learning
AdaGrad has been showed with great improvement of robustness upon traditional gra-
dient method (Dean et al., 2012). However, the problem is that as `2 norm accumulates,
the fraction of η over `2 norm decays to a substantial small term.
7.1.3 AdaDelta
AdaDelta is an extension of AdaGrad that aims to reducing the decaying rate of learning
rate, proposed in (Zeiler, 2012). Instead of accumulating the gradients of each time step as
in AdaGrad, AdaDelta re-weights previously accumulation before adding current term onto
previously accumulated result, resulting in:
where β is the weight for re-weighting. Then the update rule is the same as AdaGrad:
η
θt+1 = θt − 5t
St + θ
which is almost the same as another famous gradient variant named RMSprop10 .
7.1.4 Adam
Adam stands for Adaptive Moment Estimation, proposed in (Kingma and Ba, 2014). Adam
is like a combination momentum method and AdaGrad method, but each component are
re-weighted at time step t. Formally, at time step t, we have:
∆tθ =α∆t−1
θ + (1 − α)5tθ
(S t )2 =β(S t−1 )2 + (1 − β)(5tθ )2
η
θt+1 =θt − t ∆t
S + θ
All these modern gradient variants have been published with a promising claim that is
helpful to improve the convergence rate of previous methods. Empirically, these methods
seem to be indeed helpful, however, in many cases, a good choice of these methods seems
only to benefit to a limited extent.
7.2 Dropout
Dropout was introduced in (Hinton et al., 2012; Srivastava et al., 2014). The technique soon
got influential, not only because of its good performance but also because of its simplicity
of implementation. The idea is very simple: randomly dropping out some of the units while
training. More formally: on each training case, each hidden unit is randomly omitted from
the network with a probability of p.
As suggested by Hinton et al. (2012), Dropout can be seen as an efficient way to perform
model averaging across a large number of different neural networks, where overfitting can
be avoided with much less cost of computation.
10. It seems this method never gets published, the resources all trace back to Hinton’s slides at
http://www.cs.toronto.edu/t̃ijmen/csc321/slides/lecture slides lec6.pdf
55
Wang and Raj
Because of the actual performance it introduces, Dropout soon became very popular
upon its introduction, a lot of work has attempted to understand its mechanism in different
perspectives, including (Baldi and Sadowski, 2013; Cho, 2013; Ma et al., 2016). It has also
been applied to train other models, like SVM (Chen et al., 2014).
where µB and σB denote the mean and variance of that batch. µL and σL two parameters
learned by the algorithm to rescale and shift the output. xi and yi are inputs and outputs
of that function respectively.
These steps are performed for every batch during training. Batch Normalization turned
out to work very well in training empirically and soon became popularly.
As a follow-up, Ba et al. (2016) proposes the technique Layer Normalization, where
they “transpose” batch normalization into layer normalization by computing the mean and
variance used for normalization from all of the summed inputs to the neurons in a layer on a
single training case. Therefore, this technique has a nature advantage of being applicable to
recurrent neural network straightforwardly. However, it seems that this “transposed batch
normalization” cannot be implemented as simple as Batch Normalization. Therefore, it has
not become as influential as Batch Normalization is.
56
On the Origin of Deep Learning
Two remarks need to be made before we proceed: 1) Obviously, most of these meth-
ods can trace back to counterparts in non-parametric machine learning field, but because
most of these methods did not perform enough to raise an impact, focusing a discussion on
the evolutionary path may mislead readers. Instead, we will only list these methods for the
readers who seek for inspiration. 2) Many of these methods are not exclusively optimization
techniques because these methods are usually proposed with a particularly designed archi-
tecture. Technically speaking, these methods should be distributed to previous sections
according to the models associated. However, because these methods can barely inspire
modern modeling research, but may have a chance to inspire modern optimization research,
we list these methods in this section.
One of the earliest and most important works on this topic was proposed by Fahlman and
Lebiere (1989). They introduced a model, as well as its corresponding algorithm named
Cascade-Correlation Learning. The idea is that the algorithm starts with a minimum net-
work and builds up towards a bigger network. Whenever another hidden unit is added,
the parameters of previous hidden units are fixed, and the algorithm only searches for an
optimal parameter for the newly-added hidden unit.
Interestingly, the unique architecture of Cascade-Correlation Learning grants the net-
work to grow deeper and wider at the same time because every newly added hidden unit
takes the data together with outputs of previously added units as input.
Two important questions of this algorithm are 1) when to fix the parameters of current
hidden units and proceed to add and tune a newly added one 2) when to terminate the
entire algorithm. These two questions are answered in a similar manner: the algorithm
adds a new hidden unit when there are no significant changes in existing architecture and
terminates when the overall performance is satisfying. This training process may introduce
problems of overfitting, which might account for the fact that this method is seen much in
modern deep learning research.
Mézard and Nadal (1989) presented the idea of Tiling Algorithm, which learns the parame-
ters, the number of layers, as well as the number of hidden units in each layer simultaneously
for feedforward neural network on Boolean functions. Later this algorithm was extended to
multiple class version by Parekh et al. (1997).
The algorithm works in such a way that on every layer, it tries to build a layer of hidden
units that can cluster the data into different clusters where there is only one label in one
cluster. The algorithm keeps increasing the number of hidden units until such a clustering
pattern can be achieved and proceed to add another layer.
Mézard and Nadal (1989) also offered a proof of theoretical guarantees for Tiling Algo-
rithm. Basically, the theorem says that Tiling Algorithm can greedily improve the perfor-
mance of a neural network.
57
Wang and Raj
58
On the Origin of Deep Learning
8. Conclusion
In this paper, we have revisited the evolutionary path of the nowadays deep learning models.
We revisited the paths for three major families of deep learning models: the deep generative
model family, convolutional neural network family, and recurrent neural network family as
well as some topics for optimization techniques.
This paper could serve two goals: 1) First, it documents the major milestones in the
science history that have impacted the current development of deep learning. These mile-
stones are not limited to the development in computer science fields. 2) More importantly,
by revisiting the evolutionary path of the major milestone, this paper should be able to sug-
gest the readers that how these remarkable works are developed among thousands of other
contemporaneous publications. Here we briefly summarize three directions that many of
these milestones pursue:
• Occam’s razor: While it seems that part of the society tends to favor more complex
models by layering up one architecture onto another and hoping backpropagation can
find the optimal parameters, history says that masterminds tend to think simple:
Dropout is widely recognized not only because of its performance, but more because
of its simplicity in implementation and intuitive (tentative) reasoning. From Hopfield
Network to Restricted Boltzmann Machine, models are simplified along the iterations
until when RBM is ready to be piled-up.
• Be ambitious: If a model is proposed with substantially more parameters than
contemporaneous ones, it must solve a problem that no others can solve nicely to be
remarkable. LSTM is much more complex than traditional RNN, but it bypasses the
vanishing gradient problem nicely. Deep Belief Network is famous not due to the fact
the they are the first one to come up with the idea of putting one RBM onto another,
but due to that they come up an algorithm that allow deep architectures to be trained
effectively.
• Widely read: Many models are inspired by domain knowledge outside of machine
learning or statistics field. Human visual cortex has greatly inspired the development
of convolutional neural networks. Even the recent popular Residual Networks can find
corresponding mechanism in human visual cortex. Generative Adversarial Network
can also find some connection with game theory, which was developed fifty years ago.
We hope these directions can help some readers to impact more on current society. More
directions should also be able to be summarized through our revisit of these milestones by
readers.
Acknowledgements
Thanks to the demo from http://beej.us/blog/data/convolution-image-processing/ for a
quick generation of examples in Figure 16. Thanks to Bojian Han at Carnegie Mellon Uni-
versity for the examples in Figure 21. Thanks to the blog at http://sebastianruder.com/optimizing-
gradient-descent/index.html for a summary of gradient methods in Section 7.1. Thanks to
Yutong Zheng and Xupeng Tong at Carnegie Mellon University for suggesting some relevant
contents.
59
Wang and Raj
References
Emile Aarts and Jan Korst. Simulated annealing and boltzmann machines. 1988.
David H Ackley, Geoffrey E Hinton, and Terrence J Sejnowski. A learning algorithm for
boltzmann machines. Cognitive science, 9(1):147–169, 1985.
James A Anderson and Edward Rosenfeld. Talking nets: An oral history of neural networks.
MiT Press, 2000.
Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint
arXiv:1701.07875, 2017.
Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with
visual attention. arXiv preprint arXiv:1412.7755, 2014.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv
preprint arXiv:1607.06450, 2016.
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by
jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
Alexander Bain. Mind and Body the Theories of Their Relation by Alexander Bain. Henry
S. King & Company, 1873.
Pierre Baldi and Peter J Sadowski. Understanding dropout. In Advances in Neural Infor-
mation Processing Systems, pages 2814–2822, 2013.
Peter L Bartlett and Wolfgang Maass. Vapnik chervonenkis dimension of neural nets. The
handbook of brain theory and neural networks, pages 1188–1192, 2003.
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. In
European conference on computer vision, pages 404–417. Springer, 2006.
Yoshua Bengio. Learning deep architectures for ai. Foundations and trends
R in Machine
Learning, 2(1):1–127, 2009.
Yoshua Bengio and Olivier Delalleau. On the expressive power of deep architectures. In
International Conference on Algorithmic Learning Theory, pages 18–36. Springer, 2011.
Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with
gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994.
Yoshua Bengio, Nicolas L Roux, Pascal Vincent, Olivier Delalleau, and Patrice Marcotte.
Convex neural networks. In Advances in neural information processing systems, pages
123–130, 2005.
Yoshua Bengio, Pascal Lamblin, Dan Popovici, Hugo Larochelle, et al. Greedy layer-wise
training of deep networks. Advances in neural information processing systems, 19:153,
2007.
60
On the Origin of Deep Learning
Monica Bianchini and Franco Scarselli. On the complexity of shallow and deep neural
network classifiers. In ESANN, 2014.
James G Booth and James P Hobert. Maximizing generalized linear mixed model likelihoods
with an automated monte carlo em algorithm. Journal of the Royal Statistical Society:
Series B (Statistical Methodology), 61(1):265–285, 1999.
Nirmal K Bose et al. Neural network fundamentals with graphs, algorithms, and applications.
Number 612.82 BOS. 1996.
Martin L Brady, Raghu Raghavan, and Joseph Slawny. Back propagation fails to separate
where perceptrons succeed. IEEE Transactions on Circuits and Systems, 36(5):665–674,
1989.
George W Brown. Iterative solution of games by fictitious play. Activity analysis of pro-
duction and allocation, 13(1):374–376, 1951.
Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders.
arXiv preprint arXiv:1509.00519, 2015.
Alfredo Canziani, Adam Paszke, and Eugenio Culurciello. An analysis of deep neural
network models for practical applications. arXiv preprint arXiv:1605.07678, 2016.
Juan Luis Castro, Carlos Javier Mantas, and JM Benıtez. Neural networks with a continuous
squashing function in the output are universal approximators. Neural Networks, 13(6):
561–563, 2000.
Ning Chen, Jun Zhu, Jianfei Chen, and Bo Zhang. Dropout training for support vector
machines. arXiv preprint arXiv:1404.4171, 2014.
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel.
Infogan: Interpretable representation learning by information maximizing generative ad-
versarial nets. In Advances In Neural Information Processing Systems, pages 2172–2180,
2016.
61
Wang and Raj
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi
Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using
rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078,
2014.
Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. Describing multimedia content
using attention-based encoder-decoder networks. IEEE Transactions on Multimedia, 17
(11):1875–1886, 2015.
Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun.
The loss surfaces of multilayer networks. In AISTATS, 2015.
Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Ben-
gio. Attention-based models for speech recognition. In Advances in Neural Information
Processing Systems, pages 577–585, 2015.
Avital Cnaan, NM Laird, and Peter Slasor. Tutorial in biostatistics: Using the general
linear mixed model to analyse unbalanced repeated measures and longitudinal data. Stat
Med, 16:2349–2380, 1997.
Alberto Colorni, Marco Dorigo, Vittorio Maniezzo, et al. Distributed optimization by ant
colonies. In Proceedings of the first European conference on artificial life, volume 142,
pages 134–142. Paris, France, 1991.
David Daniel Cox and Thomas Dean. Neural networks and neuroscience-inspired computer
vision. Current Biology, 24(18):R921–R929, 2014.
G Cybenko. Continuous valued neural networks with two hidden layers are sufficient. 1988.
Zihang Dai, Amjad Almahairi, Bachman Philip, Eduard Hovy, and Aaron Courville. Cali-
brating energy-based generative adversarial networks. ICLR submission, 2017.
Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and
Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional
non-convex optimization. In Advances in neural information processing systems, pages
2933–2941, 2014.
Bert De Brabandere, Xu Jia, Tinne Tuytelaars, and Luc Van Gool. Dynamic filter networks.
In Neural Information Processing Systems (NIPS), 2016.
62
On the Origin of Deep Learning
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Andrew
Senior, Paul Tucker, Ke Yang, Quoc V Le, et al. Large scale distributed deep networks.
In Advances in neural information processing systems, pages 1223–1231, 2012.
Olivier Delalleau and Yoshua Bengio. Shallow vs. deep sum-product networks. In Advances
in Neural Information Processing Systems, pages 666–674, 2011.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-
scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009.
CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
Misha Denil, Loris Bazzani, Hugo Larochelle, and Nando de Freitas. Learning where to
attend with deep architectures for image tracking. Neural computation, 24(8):2151–2184,
2012.
Jean-Pierre Didier and Emmanuel Bigand. Rethinking physical and rehabilitation medicine:
New technologies induce new learning strategies. Springer Science & Business Media,
2011.
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online
learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):
2121–2159, 2011.
Angela Lee Duckworth, Eli Tsukayama, and Henry May. Establishing causality using lon-
gitudinal hierarchical linear modeling: An illustration predicting achievement from self-
control. Social psychological and personality science, 2010.
Samuel Frederick Edwards and Phil W Anderson. Theory of spin glasses. Journal of Physics
F: Metal Physics, 5(5):965, 1975.
Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. arXiv
preprint arXiv:1512.03965, 2015.
Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent,
and Samy Bengio. Why does unsupervised pre-training help deep learning? Journal of
Machine Learning Research, 11(Feb):625–660, 2010.
Marcus Frean. The upstart algorithm: A method for constructing and training feedforward
neural networks. Neural computation, 2(2):198–209, 1990.
63
Wang and Raj
Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic
style. arXiv preprint arXiv:1508.06576, 2015.
Felix A Gers and Jürgen Schmidhuber. Recurrent nets that time and count. In Neural
Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint
Conference on, volume 3, pages 189–194. IEEE, 2000.
Ross Girshick. Fast r-cnn. In Proceedings of the IEEE International Conference on Com-
puter Vision, pages 1440–1448, 2015.
Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies
for accurate object detection and semantic segmentation. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 580–587, 2014.
Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint
arXiv:1701.00160, 2016.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in
Neural Information Processing Systems, pages 2672–2680, 2014.
Marco Gori and Alberto Tesi. On the problem of local minima in backpropagation. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 14(1):76–86, 1992.
Céline Gravelines. Deep Learning via Stacked Sparse Autoencoders for Automated Voxel-
Wise Brain Parcellation Based on Functional Connectivity. PhD thesis, The University
of Western Ontario, 1991.
Alex Graves, Navdeep Jaitly, and Abdel-rahman Mohamed. Hybrid speech recognition with
deep bidirectional lstm. In Automatic Speech Recognition and Understanding (ASRU),
2013 IEEE Workshop on, pages 273–278. IEEE, 2013.
Klaus Greff, Rupesh Kumar Srivastava, Jan Koutnı́k, Bas R Steunebrink, and Jürgen
Schmidhuber. Lstm: A search space odyssey. arXiv preprint arXiv:1503.04069, 2015.
Richard Gregory and Patrick Cavanagh. The blind spot. Scholarpedia, 6(10):9618, 2011.
Aman Gupta, Haohan Wang, and Madhavi Ganapathiraju. Learning structure in gene
expression data using deep architectures, with an application to gene clustering. In
Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on, pages
1328–1335. IEEE, 2015.
64
On the Origin of Deep Learning
Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. Hypercolumns for
object segmentation and fine-grained localization. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 447–456, 2015.
Johan Hastad. Almost optimal lower bounds for small depth circuits. In Proceedings of the
eighteenth annual ACM symposium on Theory of computing, pages 6–20. ACM, 1986.
James V Haxby, Elizabeth A Hoffman, and M Ida Gobbini. The distributed human neural
system for face perception. Trends in cognitive sciences, 4(6):223–233, 2000.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. arXiv preprint arXiv:1512.03385, 2015.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep
residual networks. arXiv preprint arXiv:1603.05027, 2016.
Geoffrey E Hinton, Peter Dayan, Brendan J Frey, and Radford M Neal. The” wake-sleep”
algorithm for unsupervised neural networks. Science, 268(5214):1158, 1995.
Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep
belief nets. Neural computation, 18(7):1527–1554, 2006.
Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R
Salakhutdinov. Improving neural networks by preventing co-adaptation of feature de-
tectors. arXiv preprint arXiv:1207.0580, 2012.
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
John J Hopfield. Neural networks and physical systems with emergent collective computa-
tional abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982.
Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks
are universal approximators. Neural networks, 2(5):359–366, 1989.
Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard Hovy, and Eric Xing. Harnessing deep
neural networks with logic rules. arXiv preprint arXiv:1603.06318, 2016.
David H Hubel and Torsten N Wiesel. Receptive fields of single neurones in the cat’s striate
cortex. The Journal of physiology, 148(3):574–591, 1959.
65
Wang and Raj
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network train-
ing by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
Nal Kalchbrenner, Aaron van den Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals,
Alex Graves, and Koray Kavukcuoglu. Video pixel networks. arXiv preprint
arXiv:1610.00527, 2016.
Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
preprint arXiv:1412.6980, 2014.
Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.
Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-
supervised learning with deep generative models. In Advances in Neural Information
Processing Systems, pages 3581–3589, 2014.
Tinne Hoff Kjeldsen. John von neumann’s conception of the minimax theorem: a journey
through different mathematical contexts. Archive for history of exact sciences, 56(1):
39–68, 2001.
Teuvo Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464–1480, 1990.
Bryan Kolb, Ian Q Whishaw, and G Campbell Teskey. An introduction to brain and behav-
ior, volume 1273. 2014.
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz,
Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome:
Connecting language and vision using crowdsourced dense image annotations. arXiv
preprint arXiv:1602.07332, 2016.
Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images.
2009.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing systems,
pages 1097–1105, 2012.
66
On the Origin of Deep Learning
Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. Deep convo-
lutional inverse graphics network. In Advances in Neural Information Processing Systems,
pages 2539–2547, 2015.
Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Human-level concept
learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
Alan S Lapedes and Robert M Farber. How neural nets work. In Neural information
processing systems, pages 442–456, 1988.
Hugo Larochelle and Geoffrey E Hinton. Learning to combine foveal glimpses with a third-
order boltzmann machine. In Advances in neural information processing systems, pages
1243–1251, 2010.
B Boser Le Cun, John S Denker, D Henderson, Richard E Howard, W Hubbard, and
Lawrence D Jackel. Handwritten digit recognition with a back-propagation network. In
Advances in neural information processing systems. Citeseer, 1990.
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning
applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998a.
Yann LeCun, Corinna Cortes, and Christopher JC Burges. The mnist database of hand-
written digits, 1998b.
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):
436–444, 2015.
Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y Ng. Convolutional deep
belief networks for scalable unsupervised learning of hierarchical representations. In Pro-
ceedings of the 26th annual international conference on machine learning, pages 609–616.
ACM, 2009.
Zhaoping Li. A neural model of contour integration in the primary visual cortex. Neural
computation, 10(4):903–940, 1998.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan,
Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In
European Conference on Computer Vision, pages 740–755. Springer, 2014.
Cheng-Yuan Liou and Shiao-Lin Lin. Finite memory loading in hairy neurons. Natural
Computing, 5(1):15–42, 2006.
Cheng-Yuan Liou and Shao-Kuo Yuan. Error tolerant associative memory. Biological Cy-
bernetics, 81(4):331–342, 1999.
Christoph Lippert, Jennifer Listgarten, Ying Liu, Carl M Kadie, Robert I Davidson, and
David Heckerman. Fast linear mixed models for genome-wide association studies. Nature
methods, 8(10):833–835, 2011.
Zachary C Lipton, John Berkowitz, and Charles Elkan. A critical review of recurrent neural
networks for sequence learning. arXiv preprint arXiv:1506.00019, 2015.
67
Wang and Raj
David G Lowe. Object recognition from local scale-invariant features. In Computer vision,
1999. The proceedings of the seventh IEEE international conference on, volume 2, pages
1150–1157. Ieee, 1999.
Xuezhe Ma and Eduard Hovy. End-to-end sequence labeling via bi-directional lstm-cnns-crf.
arXiv preprint arXiv:1603.01354, 2016.
Xuezhe Ma, Yingkai Gao, Zhiting Hu, Yaoliang Yu, Yuntian Deng, and Eduard Hovy.
Dropout with expectation-linear regularization. arXiv preprint arXiv:1609.08017, 2016.
M Maschler, Eilon Solan, and Shmuel Zamir. Game theory. translated from the hebrew by
ziv hellman and edited by mike borns, 2013.
Charles E McCulloch and John M Neuhaus. Generalized linear mixed models. Wiley Online
Library, 2001.
Warren S McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous
activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943.
Marc Mézard and Jean-P Nadal. Learning in feedforward layered networks: The tiling
algorithm. Journal of Physics A: Mathematical and General, 22(12):2191, 1989.
Marc Mézard, Giorgio Parisi, and Miguel-Angel Virasoro. Spin glass theory and beyond.
1990.
Marvin L Minski and Seymour A Papert. Perceptrons: an introduction to computational
geometry. MA: MIT Press, Cambridge, 1969.
Melanie Mitchell. An introduction to genetic algorithms. MIT press, 1998.
Tom M Mitchell et al. Machine learning. wcb, 1997.
Jeffrey Moran and Robert Desimone. Selective attention gates visual processing in the
extrastriate cortex. Frontiers in cognitive neuroscience, 229:342–345, 1985.
Michael C Mozer. A focused back-propagation algorithm for temporal pattern recognition.
Complex systems, 3(4):349–381, 1989.
Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann
machines. In Proceedings of the 27th International Conference on Machine Learning
(ICML-10), pages 807–814, 2010.
John Nash. Non-cooperative games. Annals of mathematics, pages 286–295, 1951.
John F Nash et al. Equilibrium points in n-person games. Proc. Nat. Acad. Sci. USA, 36
(1):48–49, 1950.
Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High
confidence predictions for unrecognizable images. In 2015 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 427–436. IEEE, 2015.
68
On the Origin of Deep Learning
Danh V Nguyen, Damla Şentürk, and Raymond J Carroll. Covariate-adjusted linear mixed
effects model with an application to longitudinal data. Journal of nonparametric statis-
tics, 20(6):459–481, 2008.
Erkki Oja. Simplified neuron model as a principal component analyzer. Journal of mathe-
matical biology, 15(3):267–273, 1982.
Erkki Oja and Juha Karhunen. On stochastic approximation of the eigenvectors and eigen-
values of the expectation of a random matrix. Journal of mathematical analysis and
applications, 106(1):69–84, 1985.
Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural
networks. arXiv preprint arXiv:1601.06759, 2016.
Keiichi Osako, Rita Singh, and Bhiksha Raj. Complex recurrent neural networks for denois-
ing speech signals. In Applications of Signal Processing to Audio and Acoustics (WAS-
PAA), 2015 IEEE Workshop on, pages 1–5. IEEE, 2015.
Rajesh G Parekh, Jihoon Yang, and Vasant Honavar. Constructive neural network learning
algorithms for multi-category real-valued pattern classification. 1997.
Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How to construct
deep recurrent neural networks. arXiv preprint arXiv:1312.6026, 2013a.
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent
neural networks. ICML (3), 28:1310–1318, 2013b.
Razvan Pascanu, Yann N Dauphin, Surya Ganguli, and Yoshua Bengio. On the saddle point
problem for non-convex optimization. arXiv preprint arXiv:1405.4604, 2014.
Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier,
and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences
for richer image-to-sentence models. In Proceedings of the IEEE International Conference
on Computer Vision, pages 2641–2649, 2015.
Tomaso Poggio and Thomas Serre. Models of visual cortex. Scholarpedia, 8(4):3516, 2013.
Christopher Poultney, Sumit Chopra, Yann L Cun, et al. Efficient learning of sparse rep-
resentations with an energy-based model. In Advances in neural information processing
systems, pages 1137–1144, 2006.
Jose C Principe, Neil R Euliano, and W Curt Lefebvre. Neural and adaptive systems:
fundamentals through simulations with CD-ROM. John Wiley & Sons, Inc., 1999.
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-
time object detection with region proposal networks. In Advances in neural information
processing systems, pages 91–99, 2015.
Martin Riedmiller and Heinrich Braun. A direct adaptive method for faster backpropa-
gation learning: The rprop algorithm. In Neural Networks, 1993., IEEE International
Conference On, pages 586–591. IEEE, 1993.
69
Wang and Raj
AJ Robinson and Frank Fallside. The utility driven dynamic error propagation network.
University of Cambridge Department of Engineering, 1987.
Frank Rosenblatt. The perceptron: a probabilistic model for information storage and orga-
nization in the brain. Psychological review, 65(6):386, 1958.
David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal repre-
sentations by error propagation. Technical report, DTIC Document, 1985.
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi-
heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large
scale visual recognition challenge. International Journal of Computer Vision, 115(3):
211–252, 2015.
S Rasoul Safavian and David Landgrebe. A survey of decision tree classifier methodology.
1990.
Ruslan Salakhutdinov and Geoffrey E Hinton. Deep boltzmann machines. In AISTATS,
volume 1, page 3, 2009.
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and
Xi Chen. Improved techniques for training gans. In Advances in Neural Information
Processing Systems, pages 2226–2234, 2016.
Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural Networks,
61:85–117, 2015.
Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE Trans-
actions on Signal Processing, 45(11):2673–2681, 1997.
Thomas Serre, Lior Wolf, and Tomaso Poggio. Object recognition with features inspired
by visual cortex. In 2005 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR’05), volume 2, pages 994–1000. IEEE, 2005.
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hin-
ton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-
experts layer. arXiv preprint arXiv:1701.06538, 2017.
Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556, 2014.
Paul Smolensky. Information processing in dynamical systems: Foundations of harmony
theory. Technical report, DTIC Document, 1986.
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation
using deep conditional generative models. In Advances in Neural Information Processing
Systems, pages 3483–3491, 2015.
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-
dinov. Dropout: a simple way to prevent neural networks from overfitting. Journal of
Machine Learning Research, 15(1):1929–1958, 2014.
70
On the Origin of Deep Learning
Amos Storkey. Increasing the capacity of a hopfield network without sacrificing functionality.
In International Conference on Artificial Neural Networks, pages 451–456. Springer, 1997.
Richard S Sutton, David A McAllester, Satinder P Singh, Yishay Mansour, et al. Pol-
icy gradient methods for reinforcement learning with function approximation. In NIPS,
volume 99, pages 1057–1063, 1999.
Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian
Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint
arXiv:1312.6199, 2013.
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir
Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper
with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pat-
tern Recognition, pages 1–9, 2015.
Aaron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al.
Conditional image generation with pixelcnn decoders. In Advances In Neural Information
Processing Systems, pages 4790–4798, 2016.
Andreas Veit, Michael J Wilber, and Serge Belongie. Residual networks behave like en-
sembles of relatively shallow networks. In Advances in Neural Information Processing
Systems, pages 550–558, 2016.
Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Man-
zagol. Stacked denoising autoencoders: Learning useful representations in a deep network
with a local denoising criterion. Journal of Machine Learning Research, 11(Dec):3371–
3408, 2010.
Haohan Wang and Jingkang Yang. Multiple confounders correction with regularized linear
mixed effect models, with application in biological processes. 2016.
Haohan Wang, Aaksha Meghawat, Louis-Philippe Morency, and Eric P Xing. Select-additive
learning: Improving cross-individual generalization in multimodal sentiment analysis.
arXiv preprint arXiv:1609.05244, 2016.
Paul J Werbos. Backpropagation through time: what it does and how to do it. Proceedings
of the IEEE, 78(10):1550–1560, 1990.
Bernard Widrow et al. Adaptive” adaline” Neuron Using Chemical” memistors.”. 1960.
71
Wang and Raj
Alan L Wilkes and Nicholas J Wade. Bain on neural networks. Brain and cognition, 33(3):
295–305, 1997.
Andrew G Wilson, Christoph Dann, Chris Lucas, and Eric P Xing. The human kernel. In
Advances in Neural Information Processing Systems, pages 2854–2862, 2015.
SHI Xingjian, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-
chun Woo. Convolutional lstm network: A machine learning approach for precipitation
nowcasting. In Advances in Neural Information Processing Systems, pages 802–810, 2015.
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdi-
nov, Richard S Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption
generation with visual attention. arXiv preprint arXiv:1502.03044, 2(3):5, 2015.
Zhilin Yang, Ruslan Salakhutdinov, and William Cohen. Multi-task cross-lingual sequence
tagging from scratch. arXiv preprint arXiv:1603.06270, 2016.
Andrew Chi-Chih Yao. Separating the polynomial-time hierarchy by oracles. In 26th Annual
Symposium on Foundations of Computer Science (sfcs 1985), 1985.
Xin Yao. Evolving artificial neural networks. Proceedings of the IEEE, 87(9):1423–1447,
1999.
Dong Yu, Li Deng, and George Dahl. Roles of pre-training and fine-tuning in context-
dependent dbn-hmms for real-world speech recognition. In Proc. NIPS Workshop on
Deep Learning and Unsupervised Feature Learning, 2010.
Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint
arXiv:1605.07146, 2016.
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals.
Understanding deep learning requires rethinking generalization. arXiv preprint
arXiv:1611.03530, 2016a.
Ke Zhang, Miao Sun, Tony X Han, Xingfang Yuan, Liru Guo, and Tao Liu. Resid-
ual networks of residual networks: Multilevel residual networks. arXiv preprint
arXiv:1608.02908, 2016b.
Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial net-
work. arXiv preprint arXiv:1609.03126, 2016.
Xiang Zhou and Matthew Stephens. Genome-wide efficient mixed-model analysis for asso-
ciation studies. Nature genetics, 44(7):821–824, 2012.
72
1
Abstract—The seminal work of Gatys et al. demonstrated the power of Convolutional Neural Networks (CNNs) in creating artistic
imagery by separating and recombining image content and style. This process of using CNNs to render a content image in different
styles is referred to as Neural Style Transfer (NST). Since then, NST has become a trending topic both in academic literature and
industrial applications. It is receiving increasing attention and a variety of approaches are proposed to either improve or extend the
original NST algorithm. In this paper, we aim to provide a comprehensive overview of the current progress towards NST. We first
propose a taxonomy of current algorithms in the field of NST. Then, we present several evaluation methods and compare different NST
algorithms both qualitatively and quantitatively. The review concludes with a discussion of various applications of NST and open
arXiv:1705.04058v7 [cs.CV] 30 Oct 2018
problems for future research. A list of papers discussed in this review, corresponding codes, pre-trained models and more comparison
results are publicly available at: https://github.com/ycjing/Neural-Style-Transfer-Papers.
1 I NTRODUCTION
“Dwelling in the Fuchun Mountains” onto a photo of The Stroke-Based Rendering. Stroke-based rendering (SBR)
Great Wall. Since the algorithm of Gatys et al. does not have refers to a process of placing virtual strokes (e.g., brush
any explicit restrictions on the type of style images and also strokes, tiles, stipples) upon a digital canvas to render a
does not need ground truth results for training, it breaks the photograph with a particular style [16]. The process of
constraints of previous approaches. The work of Gatys et SBR is generally starting from a source photo, incremen-
al. opened up a new field called Neural Style Transfer (NST), tally compositing strokes to match the photo, and finally
which is the process of using Convolutional Neural Network producing a non-photorealistic imagery, which looks like
to render a content image in different styles. the photo but with an artistic style. During this process,
The seminal work of Gatys et al. has attracted wide an objective function is designed to guide the greedy or
attention from both academia and industry. In academia, iterative placement of strokes.
lots of follow-up studies were conducted to either improve The goal of SBR algorithms is to faithfully depict a
or extend this NST algorithm. The related researches of NST prescribed style. Therefore, they are generally effective at
have also led to many successful industrial applications simulating certain types of styles (e.g., oil paintings, water-
(e.g., Prisma [11], Ostagram [12], Deep Forger [13]). How- colours, sketches). However, each SBR algorithm is carefully
ever, there is no comprehensive survey summarising and designed for only one particular style and not capable of
discussing recent advances as well as challenges within this simulating an arbitrary style, which is not flexible.
new field of Neural Style Transfer. Region-Based Techniques. Region-based rendering is to
In this paper, we aim to provide an overview of cur- incorporate region segmentation to enable the adaption of
rent advances (up to March 2018) in Neural Style Transfer rendering based on the content in regions. Early region-
(NST). Our contributions are threefold. First, we investigate, based IB-AR algorithms exploit the shape of regions to
classify and summarise recent advances in the field of guide the stroke placement [17], [18]. In this way, different
NST. Second, we present several evaluation methods and stroke patterns can be produced in different semantic re-
experimentally compare different NST algorithms. Third, gions in an image. Song et al. [19] further propose a region-
we summarise current challenges in this field and propose based IB-AR algorithm to manipulate geometry for artistic
possible directions on how to deal with them in future styles. Their algorithm creates simplified shape rendering
works. effects by replacing regions with several canonical shapes.
The organisation of this paper is as follows. We start our Considering regions in rendering allows the local control
discussion with a brief review of previous artistic rendering over the level of details. However, the problem in SBR per-
methods without CNNs in Section 2. Then Section 3 ex- sists: one region-based rendering algorithm is not capable of
plores the derivations and foundations of NST. Based on the simulating an arbitrary style.
discussions in Section 3, we categorise and explain existing Example-Based Rendering. The goal of example-based
NST algorithms in Section 4. Some improvement strategies rendering is to learn the mapping between an exemplar
for these methods and their extensions will be given in pair. This category of IB-AR techniques is pioneered by
Section 5. Section 6 presents several methodologies for eval- Hertzmann et al., who propose a framework named image
uating NST algorithms and aims to build a standardised analogies [9]. Image analogies aim to learn a mapping
benchmark for follow-up studies. Then we demonstrate the between a pair of source images and target stylised images
commercial applications of NST in Section 7, including both in a supervised manner. The training set of image analogy
current successful usages and its potential applications. In comprises pairs of unstylised source images and the cor-
Section 8, we summarise current challenges in the field of responding stylised images with a particular style. Image
NST, as well as propose possible directions on how to deal analogy algorithm then learns the analogous transforma-
with them in future works. Finally, Section 9 concludes the tion from the example training pairs and creates analogous
paper and delineates several promising directions for future stylised results when given a test input photograph. Image
research. analogy can also be extended in various ways, e.g., to learn
stroke placements for portrait painting rendering [20].
In general, image analogies are effective for a variety of
2 S TYLE T RANSFER W ITHOUT N EURAL N ET- artistic styles. However, pairs of training data are usually
WORKS unavailable in practice. Another limitation is that image
Artistic stylisation is a long-standing research topic. Due analogies only exploit low-level image features. Therefore,
to its wide variety of applications, it has been an impor- image analogies typically fail to effectively capture content
tant research area for more than two decades. Before the and style, which limits the performance.
appearance of NST, the related researches have expanded Image Processing and Filtering. Creating an artistic
into an area called non-photorealistic rendering (NPR). In this image is a process that aims for image simplification and
section, we briefly review some of these artistic rendering abstraction. Therefore, it is natural to consider adopting and
(AR) algorithms without CNNs. Specifically, we focus on combining some related image processing filters to render a
artistic stylization of 2D images, which is called image-based given photo. For example, in [21], Winnemöller et al. for the
artistic rendering (IB-AR) in [14]. For a more comprehensive first time exploit bilateral [22] and difference of Gaussians
overview of IB-AR techniques, we recommend [3], [14], [15]. filters [23] to automatically produce cartoon-like effects.
Following the IB-AR taxonomy defined by Kyprianidis et al. Compared with other categories of IB-AR techniques,
[14], we first introduce each category of IB-AR techniques image-filtering based rendering algorithms are generally
without CNNs and then discuss their strengths and weak- straightforward to implement and efficient in practice. At
nesses. an expense, they are very limited in style diversity.
3
Example-Based Techniques
Image Analogy
Texture
Neural Style Transfer
Colour
Image-Optimisation-Based Model-Optimisation-Based
Online Neural Methods Offline Neural Methods
Luan’17 [84] Li’16 [46] Zhang’17 [87] Dumoulin’17 [53] Chen’16 [57] Li’18 [86]
Mechrez’17 [85]
Non-Photorealistic Chen’17 [54] Ghiasi’17 [58]
Li’17 [55] Huang’17 [51]
Zhang’17 [56] Li’17 [59]
Semantic Attribute Video Image
Image Video
Champandard’16 [65] Liao’17 [88] Huang’17 [78]
Gatys’16 [10] Ruder’16 [74] Chen’16 [68] Gupta’17 [79] 2D 3D
Li’17 [42] Mechrez’18 [69] Chen’17 [80]
Risser’17 [44] Johnson’16 [47] Chen’18 [72]
Li’17 [45] Ulyanov’16 [48]
Doodle
Ulyanov’17 [50]
Champandard’16 [65] Li’16 [52]
Selim’16 [73] Castillo’17 [71] Atarsaikhan’17 [81] Gatys’17 [60] Jiang’17 [89] Lu’17 [70] Azadi’18 [83]
Depth Stroke Size
Liu’17 [63] Wang’17 [62]
Jing’18 [61]
Figure 2: A taxonomy of NST techniques. Our proposed NST taxonomy extends the IB-AR taxonomy proposed by
Kyprianidis et al. [14].
Summary. Based on the above discussions, although statistical property to model the texture. The idea is first
some IB-AR algorithms without CNNs are capable of faith- proposed by Julesz [27], who models textures as pixel-
fully depicting certain prescribed styles, they typically have based N -th order statistics. Later, the work in [28] exploits
the limitations in flexibility, style diversity, and effective filter responses to analyze textures, instead of direct pixel-
image structure extractions. Therefore, there is a demand for based measurements. After that, Portilla and Simoncelli
novel algorithms to address these limitations, which gives [29] further introduce a texture model based on multi-
birth to the field of NST. scale orientated filter responses and use gradient descent
to improve synthesised results. A more recent parametric
texture modelling approach proposed by Gatys et al. [30]
3 D ERIVATIONS OF N EURAL S TYLE T RANSFER is the first to measure summary statistics in the domain
For a better understanding of the NST development, we of a CNN. They design a Gram-based representation to
start by introducing its derivations. To automatically trans- model textures, which is the correlations between filter
fer an artistic style, the first and most important issue responses in different layers of a pre-trained classification
is how to model and extract style from an image. Since network (VGG network) [31]. More specifically, the Gram-
style is very related to texture1 , a straightforward way is to based representation encodes the second order statistics
relate Visual Style Modelling back to previously well-studied of the set of CNN filter responses. Next, we will explain
Visual Texture Modelling methods. After obtaining the style this representation in detail for the usage of the following
representation, the next issue is how to reconstruct an image sections.
with desired style information while preserving its content, Assume that the feature map of a sample texture image
which is addressed by the Image Reconstruction techniques. Is at layer l of a pre-trained deep classification network is
F l (Is ) ∈ RC×H×W , where C is the number of channels, and
3.1 Visual Texture Modelling H and W represent the height and width of the feature map
F(Is ). Then the Gram-based representation can be obtained
Visual texture modelling [24] is previously studied as the
by computing the Gram matrix G(F l (Is )0 ) ∈ RC×C over
heart of texture synthesis [25], [26]. Throughout the history,
the feature map F l (Is )0 ∈ RC×(HW ) (a reshaped version of
there are two distinct approaches to model visual textures,
F l (Is )):
which are Parametric Texture Modelling with Summary Statis-
G(F l (Is )0 ) = [F l (Is )0 ][F l (Is )0 ]T . (1)
tics and Non-parametric Texture Modelling with Markov Ran-
dom Fields (MRFs). This Gram-based texture representation from a CNN is
1) Parametric Texture Modelling with Summary Statis- effective at modelling wide varieties of both natural and
tics. One path towards texture modelling is to capture non-natural textures. However, the Gram-based represen-
image statistics from a sample texture and exploit summary tation is designed to capture global statistics and tosses
spatial arrangements, which leads to unsatisfying results
1. We clarify that style is very related to texture but not limited to for modelling regular textures with long-range symmetric
texture. Style also involves a large degree of simplification and shape
abstraction effects, which falls back to the composition or alignment of structures. To address this problem, Berger and Memisevic
texture features. [32] propose to horizontally and vertically translate feature
4
maps by δ pixels to correlate the feature at position (i, j) and strengths. Since it is complex to define the notion of
with those at positions (i + δ, j) and (i, j + δ). In this style [3], [38] and therefore very subjective to define what
way, the representation incorporates spatial arrangement criteria are important to make a successful style transfer
information and is therefore more effective at modelling algorithm [39], here we try to evaluate these algorithms in
textures with symmetric properties. a more structural way by only focusing on details, semantics,
2) Non-parametric Texture Modelling with MRFs. An- depth and variations in brush strokes2 . We will discuss more
other notable texture modelling methodology is to use non- about the problem of aesthetic evaluation criterion in Sec-
parametric resampling. A variety of non-parametric meth- tion 8 and also present more evaluation results in Section 6.
ods are based on MRFs model, which assumes that in a Our proposed taxonomy of NST techniques is shown
texture image, each pixel is entirely characterised by its in Figure 2. We keep the taxonomy of IB-AR techniques
spatial neighbourhood. Under this assumption, Efros and proposed by Kyprianidis et al. [14] unaffected and extend
Leung [25] propose to synthesise each pixel one by one it by NST algorithms. Current NST methods fit into one of
by searching similar neighbourhoods in the source texture two categories, Image-Optimisation-Based Online Neural Meth-
image and assigning the corresponding pixel. Their work is ods (IOB-NST) and Model-Optimisation-Based Offline Neural
one of the earliest non-parametric algorithms with MRFs. Methods (MOB-NST). The first category transfers the style
Following their work, Wei and Levoy [26] further speed by iteratively optimising an image, i.e., algorithms belong
up the neighbourhood matching process by always using to this category are built upon IOB-IR techniques. The
a fixed neighbourhood. second category optimises a generative model offline and
produces the stylised image with a single forward pass,
which exploits the idea of MOB-IR techniques.
3.2 Image Reconstruction
In general, an essential step for many vision tasks is to ex-
tract an abstract representation from the input image. Image 4.1 Image-Optimisation-Based Online Neural Methods
reconstruction is a reverse process, which is to reconstruct DeepDream [40] is the first attempt to produce artistic
the whole input image from the extracted image represen- images by reversing CNN representations with IOB-IR tech-
tation. It is previously studied to analyse a particular image niques. By further combining Visual Texture Modelling tech-
representation and discover what information is contained niques to model style, IOB-NST algorithms are subsequently
in the abstract representation. Here our major focus is on proposed, which build the early foundations for the field
CNN representation based image reconstruction algorithms, of NST. Their basic idea is to first model and extract style
which can be categorised into Image-Optimisation-Based On- and content information from the corresponding style and
line Image Reconstruction (IOB-IR) and Model-Optimisation- content images, recombine them as the target representa-
Based Offline Image Reconstruction (MOB-IR). tion, and then iteratively reconstruct a stylised result that
1) Image-Optimisation-Based Online Image Recon- matches the target representation. In general, different IOB-
struction. The first algorithm to reverse CNN representa- NST algorithms share the same IOB-IR technique, but differ
tions is proposed by Mahendran and Vedaldi [33], [34]. in the way they model the visual style, which is built on the
Given a CNN representation to be reversed, their algo- aforementioned two categories of Visual Texture Modelling
rithm iteratively optimises an image (generally starting techniques. The common limitation of IOB-NST algorithms
from random noise) until it has a similar desired CNN is that they are computationally expensive, due to the itera-
representation. The iterative optimisation process is based tive image optimisation procedure.
on gradient descent in image space. Therefore, the process is
time-consuming especially when the desired reconstructed 4.1.1 Parametric Neural Methods with Summary Statistics
image is large.
The first subset of IOB-NST methods is based on Parametric
2) Model-Optimisation-Based Offline Image Recon-
Texture Modelling with Summary Statistics. The style is char-
struction. To address the efficiency issue of [33], [34],
acterised as a set of spatial summary statistics.
Dosovitskiy and Brox [35] propose to train a feed-forward
We start by introducing the first NST algorithm proposed
network in advance and put the computational burden at
by Gatys et al. [4], [10]. By reconstructing representations
training stage. At testing stage, the reverse process can be
from intermediate layers of the VGG-19 network, Gatys
simply done with a network forward pass. Their algorithm
et al. observe that a deep convolutional neural network
significantly speeds up the image reconstruction process.
is capable of extracting image content from an arbitrary
In their later work [36], they further combine Generative
photograph and some appearance information from the
Adversarial Network (GAN) [37] to improve the results.
well-known artwork. According to this observation, they
build the content component of the newly stylised image
4 A TAXONOMY OF N EURAL S TYLE T RANSFER by penalising the difference of high-level representations
A LGORITHMS derived from content and stylised images, and further build
the style component by matching Gram-based summary
NST is a subset of the aforementioned example-based IB-AR statistics of style and stylised images, which is derived from
techniques. In this section, we first provide a categorisation their proposed texture modelling technique [30] (Section
of NST algorithms and then explain major 2D image based 3.1). The details of their algorithm are as follows.
non-photorealistic NST algorithms (Figure 2, purple boxes)
in detail. More specifically, for each algorithm, we start by 2. We claim that the visual criteria with respect to a successful style
introducing the main idea and then discuss its weaknesses transfer are definitely not limited to these factors.
5
Given a content image Ic and a style image Is , the algo- on the type of style images, which addresses the limitations
rithm in [4] tries to seek a stylised image I that minimises of previous IB-AR algorithms without CNNs (Section 2).
the following objective: However, the algorithm of Gatys et al. does not perform well
in preserving the coherence of fine structures and details
I ∗ = arg min Ltotal (Ic , Is , I)
I during stylisation since CNN features inevitably lose some
(2)
= arg min αLc (Ic , I) + βLs (Is , I), low-level information. Also, it generally fails for photore-
I alistic synthesis, due to the limitations of Gram-based style
where Lc compares the content representation of a given representation. Moreover, it does not consider the variations
content image to that of the stylised image, and Ls compares of brush strokes and the semantics and depth information
the Gram-based style representation derived from a style contained in the content image, which are important factors
image to that of the stylised image. α and β are used to in evaluating the visual quality.
balance the content component and style component in the In addition, a Gram-based style representation is not the
stylised result. only choice to statistically encode style information. There
The content loss Lc is defined by the squared Euclidean are also some other effective statistical style representations,
distance between the feature representations F l of the con- which are derived from a Gram-based representation. Li
tent image Ic in layer l and that of the stylised image I et al. [42] derive some different style representations by
which is initialised with a noise image: considering style transfer in the domain of transfer learning,
X or more specifically, domain adaption [43]. Given that training
Lc = kF l (Ic ) − F l (I)k2 , (3) and testing data are drawn from different distributions, the
l∈{lc }
goal of domain adaption is to adapt a model trained on
where {lc } denotes the set of VGG layers for computing labelled training data from a source domain to predict labels
the content loss. For the style loss Ls , [4] exploits Gram- of unlabelled testing data from a target domain. One way for
based visual texture modelling technique to model the style, domain adaption is to match a sample in the source domain
which has already been explained in Section 3.1. Therefore, to that in the target domain by minimising their distribution
the style loss is defined by the squared Euclidean distance discrepancy, in which Maximum Mean Discrepancy (MMD)
between the Gram-based style representations of Is and I : is a popular choice to measure the discrepancy between
X
Ls = kG(F l (Is )0 ) − G(F l (I)0 )k2 , (4) two distributions. Li et al. prove that matching Gram-based
l∈{ls } style representations between a pair of style and stylised
where G is the aforementioned Gram matrix to encode the images is intrinsically minimising MMD with a quadratic
second order statistics of the set of filter responses. {ls } polynomial kernel. Therefore, it is expected that other kernel
represents the set of VGG layers for calculating the style functions for MMD can be equally applied in NST, e.g.,
loss. the linear kernel, polynomial kernel and Gaussian kernel.
The choice of content and style layers is an important Another related representation is the batch normalisation
factor in the process of style transfer. Different positions (BN) statistic representation, which is to use mean and
and numbers of layers can result in very different visual variance of the feature maps in VGG layers to model style:
experiences. Given the pre-trained VGG-19 [31] as the loss l
C
network, Gatys et al.’s choice of {ls } and {lc } in [4] is X 1 X
Ls = kµ(Fcl (Is )) − µ(Fcl (I))k2 +
{ls } = {relu1 1, relu2 1, relu3 1, relu4 1, relu5 1} and l∈{ls } C l
c=1
{lc } = {relu4 2}. For {ls }, the idea of combining multiple
layers (up to higher layers) is critical for the success of Gatys
kσ(Fcl (Is )) − σ(Fcl (I))k2 , (5)
et al.’s NST algorithm. Matching the multi-scale style repre- where Fcl ∈ RH×W is the c-th feature map channel at layer
sentations leads to a smoother and more continuous stylisa- l of VGG network, and C l is the number of channels.
tion, which gives the visually most appealing results [4]. For The main contribution of Li et al.’s algorithm is to
the content layer {lc }, matching the content representations theoretically demonstrate that the Gram matrices matching
on a lower layer preserves the undesired fine structures process in NST is equivalent to minimising MMD with the
(e.g., edges and colour map) of the original content image second order polynomial kernel, thus proposing a timely
during stylisation. In contrast, by matching the content on interpretation of NST and making the principle of NST
a higher layer of the network, the fine structures can be clearer. However, the algorithm of Li et al. does not resolve
altered to agree with the desired style while preserving the the aforementioned limitations of Gatys et al.’s algorithm.
content information of the content image. Also, using VGG- One limitation of the Gram-based algorithm is its in-
based loss networks for style transfer is not the only option. stabilities during optimisations. Also, it requires manually
Similar performance can be achieved by selecting other pre- tuning the parameters, which is very tedious. Risser et al.
trained classification networks, e.g., ResNet [41]. [44] find that feature activations with quite different means
In Equation (2), both Lc and Ls are differentiable. Thus, and variances can still have the same Gram matrix, which is
with random noise as the initial I , Equation (2) can be the main reason of instabilities. Inspired by this observation,
minimised by using gradient descent in image space with Risser et al. introduce an extra histogram loss, which guides
backpropagation. In addition, a total variation denoising the optimisation to match the entire histogram of feature
term is usually added in practice to encourage the smooth- activations. They also present a preliminary solution to
ness in the stylised result. automatic parameter tuning, which is to explicitly prevent
The algorithm of Gatys et al. does not need ground truth gradients with extreme values through extreme gradient
data for training and also does not have explicit restrictions normalisation.
6
By additionally matching the histogram of feature ac- 4.2 Model-Optimisation-Based Offline Neural Methods
tivations, the algorithm of Risser et al. achieves a more Although IOB-NST is able to yield impressive stylised im-
stable style transfer with fewer iterations and parameter ages, there are still some limitations. The most concerned
tuning efforts. However, its benefit comes at an expense of limitation is the efficiency issue. The second category MOB-
a high computational complexity. Also, the aforementioned NST addresses the speed and computational cost issue by
weaknesses of Gatys et al.’s algorithm still exist, e.g., a lack exploiting MOB-IR to reconstruct the stylised result, i.e.,
of consideration in depth and the coherence of details. a feed-forward network g is optimised over a large set of
All these aforementioned neural methods only compare images Ic for one or more style images Is :
content and stylised images in the CNN feature space to
make the stylised image semantically similar to the content θ∗ = arg min Ltotal (Ic , Is , gθ∗ (Ic )), I ∗ = gθ∗ (Ic ). (7)
θ
image. But since CNN features inevitably lose some low-
level information contained in the image, there are usually Depending on the number of artistic styles a single g can
some unappealing distorted structures and irregular arte- produce, MOB-NST algorithms are further divided into Per-
facts in the stylised results. To preserve the coherence of Style-Per-Model (PSPM) MOB-NST methods , Multiple-Style-
fine structures during stylisation, Li et al. [45] propose to Per-Model (MSPM) MOB-NST Methods, and Arbitrary-Style-
incorporate additional constraints upon low-level features Per-Model (ASPM) MOB-NST Methods.
in pixel space. They introduce an additional Laplacian loss,
which is defined as the squared Euclidean distance between 4.2.1 Per-Style-Per-Model Neural Methods
the Laplacian filter responses of a content image and stylised 1) Parametric PSPM with Summary Statistics. The first
result. Laplacian filter computes the second order deriva- two MOB-NST algorithms are proposed by Johnson et al.
tives of the pixels in an image and is widely used for edge [47] and Ulyanov et al. [48] respectively. These two methods
detection. share a similar idea, which is to pre-train a feed-forward
The algorithm of Li et al. has a good performance in pre- style-specific network and produce a stylised result with
serving the fine structures and details during stylisation. But a single forward pass at testing stage. They only differ in
it still lacks considerations in semantics, depth, variations in the network architecture, for which Johnson et al. ’s design
brush strokes, etc. roughly follows the network proposed by Radford et al. [49]
but with residual blocks as well as fractionally strided con-
4.1.2 Non-parametric Neural Methods with MRFs volutions, and Ulyanov et al. use a multi-scale architecture
Non-parametric IOB-NST is built on the basis of Non- as the generator network. The objective function is similar
parametric Texture Modelling with MRFs. This category con- to the algorithm of Gatys et al. [4], which indicates that they
siders NST at a local level, i.e., operating on patches to are also Parametric Methods with Summary Statistics.
match the style. The algorithms of Johnson et al. and Ulyanov et al.
Li and Wand [46] are the first to propose an MRF- achieve a real-time style transfer. However, their algorithm
based NST algorithm. They find that the parametric NST design basically follows the algorithm of Gatys et al., which
method with summary statistics only captures the per- makes them suffer from the same aforementioned issues as
pixel feature correlations and does not constrain the spatial Gatys et al.’s algorithm (e.g., a lack of consideration in the
layout, which leads to a less visually plausible result for coherence of details and depth information).
photorealistic styles. Their solution is to model the style in a Shortly after [47], [48], Ulyanov et al. [50] further find
non-parametric way and introduce a new style loss function that simply applying normalisation to every single image
which includes a patch-based MRF prior: rather than a batch of images (precisely batch normalization
X m
X (BN)) leads to a significant improvement in stylisation qual-
Ls = kΨi (F l (I)) − ΨN N (i) (F l (Is ))k2 , (6) ity. This single image normalisation is called instance normal-
l∈{ls }
i=1 isation (IN), which is equivalent to batch normalisation when
where Ψ(F l (I)) is the set of all local patches from the the batch size is set to 1. The style transfer network with
feature map F l (I). Ψi denotes the ith local patch and IN is shown to converge faster than BN and also achieves
ΨN N (i) is the most similar style patch with the i-th local visually better results. One interpretation is that IN is a form
patch in the stylised image I . The best matching ΨN N (i) of style normalisation and can directly normalise the style
is obtained by calculating normalised cross-correlation over of each content image to the desired style [51]. Therefore,
all style patches in the style image Is . m is the total number the objective is easier to learn as the rest of the network only
of local patches. Since their algorithm matches a style in needs to take care of the content loss.
the patch-level, the fine structure and arrangement can be 2) Non-parametric PSPM with MRFs. Another work
preserved much better. by Li and Wand [52] is inspired by the MRF-based NST
The advantage of the algorithm of Li and Wand is that [46] algorithm in Section 4.1.2. They address the efficiency
it performs especially well for photorealistic styles, or more issue by training a Markovian feed-forward network using
specifically, when the content photo and the style are similar adversarial training. Similar to [46], their algorithm is a
in shape and perspective, due to the patch-based MRF Patch-based Non-parametric Method with MRFs. Their method
loss. However, it generally fails when the content and style is shown to outperform the algorithms of Johnson et al. and
images have strong differences in perspective and structure Ulyanov et al. in the preservation of coherent textures in
since the image patches could not be correctly matched. complex images, thanks to their patch-based design. How-
It is also limited in preserving sharp details and depth ever, their algorithm has a less satisfying performance with
information. non-texture styles (e.g., face images), since their algorithm
7
lacks a consideration in semantics. Other weaknesses of learn a new style and a flexible control over style fusion.
their algorithm include a lack of consideration in depth However, they do not address the common limitations of
information and variations of brush strokes, which are im- NST algorithms, e.g., a lack of details, semantics, depth and
portant visual factors. variations in brush strokes.
2) Combining both style and content as inputs. One
4.2.2 Multiple-Style-Per-Model Neural Methods disadvantage of the first category is that the model size
Although the above PSPM approaches can produce stylised generally becomes larger with the increase of the number
images two orders of magnitude faster than previous IOB- of learned styles. The second path of MSPM addresses this
NST methods, separate generative networks have to be limitation by fully exploring the capability of one single
trained for each particular style image, which is quite time- network and combining both content and style into the
consuming and inflexible. But many paintings (e.g., impres- network for style identification. Different MSPM algorithms
sionist paintings) share similar paint strokes and only differ differ in the way to incorporate style into the network.
in their colour palettes. Intuitively, it is redundant to train In [55], given N target styles, Li et al. design a selection
a separate network for each of them. MSPM is therefore unit for style selection, which is a N -dimensional one-hot
proposed, which improves the flexibility of PSPM by further vector. Each bit in the selection unit represents a specific
incorporating multiple styles into one single model. There style Is in the set of target styles. For each bit in the
are generally two paths towards handling this problem: 1) selection unit, Li et al. first sample a corresponding noise
tying only a small number of parameters in a network to map f (Is ) from a uniform distribution and then feed f (Is )
each style ( [53], [54]) and 2) still exploiting only a single into the style sub-network to obtain the corresponding style
network like PSPM but combining both style and content as encoded features F(f (Is )). By feeding the concatenation
inputs ( [55], [56]). of the style encoded features F(f (Is )) and the content
1) Tying only a small number of parameters to each encoded features Enc(Ic ) into the decoder part Dec of the
style. An early work by Dumoulin et al. [53] is built on style transfer network, the desired stylised result can be
the basis of the proposed IN layer in PSPM algorithm [50] produced: I = Dec( F(f (Is )) ⊕ Enc(Ic ) ).
(Section 4.2.1). They surprisingly find that using the same Another work by Zhang and Dana [56] first forwards
convolutional parameters but only scaling and shifting pa- each style image in the style set through the pre-trained
rameters in IN layers is sufficient to model different styles. VGG network and obtain multi-scale feature activations
Therefore, they propose an algorithm to train a conditional F(Is ) in different VGG layers. Then multi-scale F(Is ) are
multi-style transfer network based on conditional instance combined with multi-scale encoded features Enc(Ic ) from
normalisation (CIN), which is defined as: different layers in the encoder through their proposed
inspiration layers. The inspiration layers are designed to
F(Ic ) − µ(F(Ic ))
CIN(F(Ic ), s) = γ s + β s , (8) reshape F(Is ) to match the desired dimension, and also
σ(F(Ic )) have a learnable weight matrix to tune feature maps to help
where F is the input feature activation and s is the index minimise the objective function.
of the desired style from a set of style images. As shown in The second type of MSPM addresses the limitation of
Equation (8), the conditioning for each style Is is done by the increased model size in the first type of MSPM. At an
scaling and shifting parameters γ s and β s after normalising expense, the style scalability of the second type of MSPM
feature activation F(Ic ), i.e., each style Is can be achieved is much smaller, since only one single network is used for
by tuning parameters of an affine transformation. The in- multiple styles. We will quantitatively compare the style
terpretation is similar to that for [50] in Section 4.2.1, i.e., scalability of different MSPM algorithms in Section 6. In ad-
the normalisation of feature statistics with different affine dition, some aforementioned limitations in the first type of
parameters can normalise input content image to different MSPM still exist, i.e., the second type of MSPM algorithms
styles. Furthermore, the algorithm of Dumoulin et al. can are still limited in preserving the coherence of fine structures
also be extended to combine multiple styles in a single and also depth information.
stylised result by combining affine parameters of different
styles. 4.2.3 Arbitrary-Style-Per-Model Neural Methods
Another algorithm which follows the first path of MSPM The third category, ASPM-MOB-NST, aims at one-model-
is proposed by Chen et al. [54]. Their idea is to explicitly for-all, i.e., one single trainable model to transfer arbitrary
decouple style and content, i.e., using separate network artistic styles. There are also two types of ASPM, one built
components to learn the corresponding content and style upon Non-parametric Texture Modelling with MRFs and the
information. More specifically, they use mid-level convolu- other one built upon Parametric Texture Modelling with Sum-
tional filters (called “StyleBank” layer) to individually learn mary Statistics.
different styles. Each style is tied to a set of parameters 1) Non-parametric ASPM with MRFs. The first ASPM
in “StyleBank” layer. The rest components in the network algorithm is proposed by Chen and Schmidt [57]. They first
are used to learn content information, which is shared extract a set of activation patches from content and style
by different styles. Their algorithm also supports flexible feature activations computed in pre-trained VGG network.
incremental training, which is to fix the content components Then they match each content patch to the most similar
in the network and only train a “StyleBank” layer for a new style patch and swap them (called “Style Swap” in [57]).
style. The stylised result can be produced by reconstructing the
In summary, both the algorithms of Dumoulin et al. resulting activation map after “Style Swap”, with either
and Chen et al. have the benefits of little efforts needed to IOB-IR or MOB-IR techniques. The algorithm of Chen and
8
Schmidt is more flexible than the previous approaches single-level stylisation to multi-level stylisation to further
due to its characteristic of one-model-for-all-style. But the improve visual quality.
stylised results of [57] are less appealing since the content The algorithm of Li et al. is the first ASPM algorithm to
patches are typically swapped with the style patches which transfer artistic styles in a learning-free manner. Therefore,
are not representative of the desired style. As a result, the compared with [51], it does not have the limitation in
content is well preserved while the style is generally not generalisation capabilities. But the algorithm of Li et al. is
well reflected. still not effective at producing sharp details and fine strokes.
2) Parametric ASPM with Summary Statistics. Con- The stylisation results will be shown in Section 6. Also, it
sidering [53] in Section 4.2.2, the simplest approach for lacks a consideration in preserving depth information and
arbitrary style transfer is to train a separate parameter variations in brush strokes.
prediction network P to predict γ s and β s in Equation (8)
with a number of training styles [58]. Given a test style
image Is , CIN layers in the style transfer network take affine
5 I MPROVEMENTS AND E XTENSIONS
parameters γ s and β s from P (Is ), and normalise the input Since the emergence of NST algorithms, there are also some
content image to the desired style with a forward pass. researches devoted to improving current NST algorithms
Another similar approach based on [53] is proposed by by controlling perceptual factors (e.g., stroke size control,
Huang and Belongie [51]. Instead of training a parameter spatial style control, and colour control) (Figure 2, green
prediction network, Huang and Belongie propose to modify boxes). Also, all of aforementioned NST methods are de-
conditional instance normalisation (CIN) in Equation (8) to signed for general still images. They may not be appropriate
adaptive instance normalisation (AdaIN): for specialised types of images and videos (e.g., doodles,
head portraits, and video frames). Thus, a variety of follow-
up studies (Figure 2, pink boxes) aim to extend general NST
AdaIN(F(Ic ), F(Is )) =
algorithms to these particular types of images and even
F(Ic ) − µ(F(Ic ))
σ(F(Is )) + µ(F(Is )). (9) extend them beyond artistic image style (e.g., audio style).
σ(F(Ic )) Controlling Perceptual Factors in Neural Style Trans-
fer. Gatys et al. themselves [60] propose several slight
AdaIN transfers the channel-wise mean and variance fea- modifications to improve their previous algorithm [4]. They
ture statistics between content and style feature activations, demonstrate a spatial style control strategy to control the
which also shares a similar idea with [57]. Different from style in each region of the content image. Their idea is to
[53], the encoder in the style transfer network of [51] is define guidance channels for the feature activations for both
fixed and comprises the first few layers in pre-trained VGG content and style image. The guidance channel has values in
network. Therefore, F in [51] is the feature activation from [0, 1] specifying which style should be transferred to which
a pre-trained VGG network. The decoder part needs to content region, i.e., the content regions where the content
be trained with a large set of style and content images guidance channel is 1 should be rendered with the style
to decode resulting feature activations after AdaIN to the where the style guidance channel is equal to 1. While for the
stylised result: I = Dec( AdaIN(F(Ic ), F(Is )) ). colour control, the original NST algorithm produces stylised
The algorithm of Huang and Belongie [51] is the first images with the colour distribution of the style image.
ASPM algorithm that achieves a real-time stylisation. How- However, sometimes people prefer a colour-preserving style
ever, the algorithm of Huang and Belongie [51] is data- transfer, i.e., preserving the colour of the content image
driven and limited in generalising on unseen styles. Also, during style transfer. The corresponding solution is to first
simply adjusting the mean and variance of feature statistics transform the style image’s colours to match the content im-
makes it hard to synthesise complicated style patterns with age’s colours before style transfer, or alternatively perform
rich details and local structures. style transfer only in the luminance channel.
A more recent work by Li et al. [59] attempts to exploit a For stroke size control, the problem is much more com-
series of feature transformations to transfer arbitrary artistic plex. We show sample results of stroke size control in
style in a style learning free manner. Similar to [51], Li et al. Figure 3. The discussions of stroke size control strategy need
use the first few layers of pre-trained VGG as the encoder to be split into several cases [61]:
and train the corresponding decoder. But they replace the 1) IOB-NST with non-high-resolution images: Since current
AdaIN layer [51] in between the encoder and decoder style statistics (e.g., Gram-based and BN-based statistics)
with a pair of whitening and colouring transformations are scale-sensitive [61], to achieve different stroke sizes, the
(WCT): I = Dec( WCT(F(Ic ), F(Is )) ). Their algorithm is solution is simply resizing a given style image to different
built on the observation that the whitening transformation scales.
can remove the style related information and preserve the 2) MOB-NST with non-high-resolution images: One possi-
structure of content. Therefore, receiving content activations ble solution is to resize the input image to different scales
F(Ic ) from the encoder, whitening transformation can filter before the forward pass, which inevitably hurts stylisation
the original style out of the input content image and return a quality. Another possible solution is to train multiple mod-
filtered representation with only content information. Then, els with different scales of a style image, which is space and
by applying colouring transformation, the style patterns time consuming. Also, the possible solution fails to preserve
contained in F(Is ) are incorporated into the filtered content stroke consistency among results with different stroke sizes,
representation, and the stylised result I can be obtained by i.e., the results vary in stroke orientations, stroke configu-
decoding the transformed features. They also extend this rations, etc. However, users generally desire to only change
9
(a) Content (b) Style (c) Small Stroke Size (d) Large Stroke Size
Figure 3: Control the brush stroke size in NST. (c) is the output with smaller brush size and (d) with larger brush size. The
style image is “The Starry Night” by Vincent van Gogh.
the stroke size but not others. To address this problem, Jing 1) Image-Optimisation-Based Semantic Style Transfer. Since
et al. [61] propose a stroke controllable PSPM algorithm. the patch matching scheme naturally meets the require-
The core component of their algorithm is a StrokePyramid ments of the region-based correspondence, Champandard
module, which learns different stroke sizes with adaptive [65] proposes to build a semantic style transfer algorithm
receptive fields. Without trading off quality and speed, their based on the aforementioned patch-based algorithm [46]
algorithm is the first to exploit one single model to achieve (Section 4.1.2). Although the result produced by the algo-
flexible continuous stroke size control while preserving rithm of Li and Wand [46] is close to the target of semantic
stroke consistency, and further achieve spatial stroke size con- style transfer, [46] does not incorporate an accurate segmen-
trol to produce new artistic effects. Although one can also tation mask, which sometimes leads to a wrong semantic
use ASPM algorithm to control stroke size, ASPM trades match. Therefore, Champandard augments an additional
off quality and speed. As a result, ASPM is not effective at semantic channel upon [46], which is a downsampled se-
producing fine strokes and details compared with [61]. mantic segmentation map. The segmentation map can be
3) IOB-NST with high-resolution images: For high- either manually annotated or from a semantic segmentation
resolution images (e.g., 3000 × 3000 pixels in [60]), a large algorithm [66], [67]. Despite the effectiveness of [65], MRF-
stroke size cannot be achieved by simply resizing style based design is not the only choice. Instead of combining
image to a large scale. Since only the region in the content MRF prior, Chen and Hsu [68] provide an alternative way
image with a receptive field size of VGG can be affected for semantic style transfer, which is to exploit masking out
by a neuron in the loss network, there is almost no visual process to constrain the spatial correspondence and also
difference between a large and larger brush strokes in a a higher order style feature statistic to further improve
small image region with receptive field size. Gatys et al. [60] the result. More recently, Mechrez et al. [69] propose an
tackle this problem by proposing a coarse-to-fine IOB-NST alternative contextual loss to realise semantic style transfer
procedure with several steps of downsampling, stylising, in a segmentation-free manner.
upsampling and final stylising. 2) Model-Optimisation-Based Semantic Style Transfer. As
4) MOB-NST with high-resolution images: Similar to 3), before, the efficiency issue is always a big issue. Both [65]
stroke size in stylised result does not vary with style image and [68] are based on IOB-NST algorithms and therefore
scale for high-resolution images. The solution is also similar leave much room for improvement. Lu et al. [70] speed
to Gatys et al. ’s algorithm in [60], which is a coarse- up the process by optimising the objective function in
to-fine stylisation procedure [62]. The idea is to exploit a feature space, instead of in pixel space. More specifically,
multimodel, which comprises multiple subnetworks. Each they propose to do feature reconstruction, instead of image
subnetwork receives the upsampled stylised result of the reconstruction as previous algorithms do. This optimisation
previous subnetwork as the input, and stylises it again with strategy reduces the computation burden, since the loss does
finer strokes. not need to propagate through a deep network. The result-
Another limitation of current NST algorithms is that ing reconstructed feature is decoded into the final result
they do not consider the depth information contained in the with a trained decoder. Since the speed of [70] does not reach
image. To address this limitation, the depth preserving NST real-time, there is still big room for further research.
algorithm [63] is proposed. Their approach is to add a depth Instance Style Transfer. Instance style transfer is built
loss function based on [47] to measure the depth difference on instance segmentation and aims to stylise only a single
between the content image and the stylised image. The user-specified object within an image. The challenge mainly
image depth is acquired by applying a single-image depth lies in the transition between a stylised object and non-
estimation algorithm (e.g., Chen et al.’s work in [64]). stylised background. Castillo et al. [71] tackle this problem
Semantic Style Transfer. Given a pair of style and by adding an extra MRF-based loss to smooth and anti-alias
content images which are similar in content, the goal of boundary pixels.
semantic style transfer is to build a semantic correspondence Doodle Style Transfer. An interesting extension can be
between the style and content, which maps each style re- found in [65], which is to exploit NST to transform rough
gion to a corresponding semantically similar content region. sketches into fine artworks. The method is simply discard-
Then the style in each style region is transferred to the ing content loss term and using doodles as segmentation
semantically similar content region. map to do semantic style transfer.
10
Stereoscopic Style Transfer. Driven by the demand of diction. By training these two networks jointly, font style
AR/VR, Chen et al. [72] propose a stereoscopic NST al- transfer can be realised in an end-to-end manner.
gorithm for stereoscopic images. They propose a disparity Photorealistic Style Transfer. Photorealistic style trans-
loss to penalise the bidirectional disparity. Their algorithm fer (also known as colour style transfer) aims to transfer
is shown to produce more consistent strokes for different the style of colour distributions. The general idea is to
views. build upon current semantic style transfer but to eliminate
Portrait Style Transfer. Current style transfer algorithms distortions and preserve the original structure of the content
are usually not optimised for head portraits. As they do not image.
impose spatial constraints, directly applying these existing 1) Image-Optimisation-Based Photorealistic Style Transfer.
algorithms to head portraits will deform facial structures, The earliest photorealistic style transfer approach is pro-
which is unacceptable for the human visual system. Selim et posed by Luan et al. [84]. They propose a two-stage opti-
al. [73] address this problem and extend [4] to head portrait misation procedure, which is to initialise the optimisation
painting transfer. They propose to use the notion of gain by stylising a given photo with non-photorealistic style
maps to constrain spatial configurations, which can preserve transfer algorithm [65] and then penalise image distortions
the facial structures while transferring the texture of the by adding a photorealism regularization. But since Luan
style image. et al.’s algorithm is built on the Image-Optimisation-Based
Video Style Transfer. NST algorithms for video se- Semantic Style Transfer method [65], their algorithm is com-
quences are substantially proposed shortly after Gatys et putationally expensive. Similar to [84], another algorithm
al.’s first NST algorithm for still images [4]. Different proposed by Mechrez et al. [85] also adopts a two-stage
from still image style transfer, the design of video style optimisation procedure. They propose to refine the non-
transfer algorithm needs to consider the smooth transi- photorealistic stylised result by matching the gradients in
tion between adjacent video frames. Like before, we di- the output image to those in the content photo. Compared
vide related algorithms into Image-Optimisation-Based and to [84], the algorithm of Mechrez et al. achieves a faster
Model-Optimisation-Based Video Style Transfer. photorealistic stylisation speed.
1) Image-Optimisation-Based Online Video Style Transfer. 2) Model-Optimisation-Based Photorealistic Style Transfer. Li
The first video style transfer algorithm is proposed by Ruder et al. [86] address the efficiency issue of [84] by handling this
et al. [74], [75]. They introduce a temporal consistency loss problem with two steps, the stylisation step and smoothing
based on optical flow to penalise the deviations along point step. The stylisation step is to apply the NST algorithm in
trajectories. The optical flow is calculated by using novel [59] but replace upsampling layers with unpooling layers
optical flow estimation algorithms [76], [77]. As a result, to produce the stylised result with fewer distortions. Then
their algorithm eliminates temporal artefacts and produces the smoothing step further eliminates structural artefacts.
smooth stylised videos. However, they build their algorithm These two aforementioned algorithms [84], [86] are mainly
upon [4] and need several minutes to process a single frame. designed for natural images. Another work in [87] proposes
2) Model-Optimisation-Based Offline Video Style Transfer. to exploit GAN to transfer the colour from human-designed
Several follow-up studies are devoted to stylising a given anime images to sketches. Their algorithm demonstrates a
video in real-time. Huang et al. [78] propose to augment promising application of Photorealistic Style Transfer, which
Ruder et al.’s temporal consistency loss [74] upon cur- is the automatic image colourisation.
rent PSPM algorithm. Given two consecutive frames, the Attribute Style Transfer. Image attributes are generally
temporal consistency loss is directly computed using two referred to image colours, textures, etc. Previously, image
corresponding outputs of style transfer network to encour- attribute transfer is accomplished through image analogy
age pixel-wise consistency, and a corresponding two-frame [9] in a supervised manner (Section 2). Derived from the
synergic training strategy is introduced for the computa- idea of patch-based NST [46], Liao et al. [88] propose a deep
tion of temporal consistency loss. Another concurrent work image analogy to study image analogy in the domain of
which shares a similar idea with [78] but with an additional CNN features. Their algorithm is based on a patch matching
exploration of style instability problem can be found in [79]. technique and realises a weakly supervised image analogy,
Different from [78], [79], Chen et al. [80] propose a flow i.e., their algorithm only needs a single pair of source and
subnetwork to produce feature flow and incorporate optical target images instead of a large training set.
flow information in feature space. Their algorithm is built Fashion Style Transfer. Fashion style transfer receives
on a pre-trained style transfer network (an encoder-decoder fashion style image as the target and generates clothing
pair) and wraps feature activations from the pre-trained images with desired fashion styles. The challenge of Fashion
stylisation encoder using the obtained feature flow. Style Transfer lies in the preservation of similar design
Character Style Transfer. Given a style image containing with the basic input clothing while blending desired style
multiple characters, the goal of Character Style Transfer is to patterns. This idea is first proposed by Jiang and Fu [89].
apply the idea of NST to generate new fonts and text effects. They tackle this problem by proposing a pair of fashion style
In [81], Atarsaikhan et al. directly apply the algorithm in [4] generator and discriminator.
to font style transfer and achieve visually plausible results. Audio Style Transfer. In addition to transferring im-
While Yang et al. [82] propose to first characterise style age styles, [90], [91] extend the domain of image style to
elements and exploit extracted characteristics to guide the audio style, and synthesise new sounds by transferring
generation of text effects. A more recent work [83] designs the desired style from a target audio. The study of audio
a conditional GAN model for glyph shape prediction, and style transfer also follows the route of image style transfer,
also an ornamentation network for colour and texture pre- i.e., Audio-Optimisation-Based Online Audio Style Transfer and
11
Figure 5: Some example results of IOB-NST and PSPM-MOB-NST for qualitative evaluation. The content images are
from the benchmark dataset proposed by Mould and Rosin [92], [93]. The style images are in the public domain. Detailed
information of our style images can be found in Table 1.
Content:
Figure 6: Saliency detection results of IOB-NST and PSPM-MOB-NST, corresponding to Figure 5. The results are produced
by using the discriminative regional feature integration approach proposed by Wang et al. [94].
13
Dumoulin
et al. [53]:
Li et al. [55]:
Figure 7: Some example results of MSPM-MOB-NST for qualitative evaluation. The content images are from the
benchmark dataset proposed by Mould and Rosin [92], [93]. The style images are in the public domain. Detailed information
of our style images can be found in Table 1.
Content:
Dumoulin
et al. [53]:
Li et al. [55]:
Figure 8: Saliency detection results of MSPM-MOB-NST, corresponding to Figure 7. The results are produced by using the
discriminative regional feature integration approach proposed by Wang et al. [94].
14
Huang and
Belongie [51]:
Li et al. [59]:
Figure 9: Some example results of ASPM-MOB-NST for qualitative evaluation. The content images are from the benchmark
dataset proposed by Mould and Rosin [92], [93]. The style images are in the public domain. Detailed information of our
style images can be found in Table 1.
Content:
Huang and
Belongie [51]:
Li et al. [59]:
Figure 10: Saliency detection results of ASPM-MOB-NST, corresponding to Figure 9. The results are produced by using
the discriminative regional feature integration approach proposed by Wang et al. [94].
15
material3 . But [59] is not effective at producing sharp details and fine
1) Results of IOB-NST. Following the content and style strokes.
images, Figure 5 contains the results of Gatys et al.’s IOB- Saliency Comparison. NST is an art creation process.
NST algorithm based on online image optimisation [4]. The As indicated in [3], [38], [39], the definition of style is
style transfer process is computationally expensive, but in subjective and also very complex, which involves personal
contrast, the results are appealing in visual quality. There- preferences, texture compositions as well as the used tools
fore, the algorithm of Gatys et al. is usually regarded as the and medium. As a result, it is difficult to define the aesthetic
gold-standard method in the community of NST. criterion for a stylised artwork. For the same stylised result,
2) Results of PSPM-MOB-NST. Figure 5 shows the different people may have different or even opposite views.
results of Per-Style-Per-Model MOB-NST algorithms (Section Nevertheless, our goal is to compare the results of different
4.2). Each model only fits one style. It can be noticed that NST techniques (shown in Figure 5, Figure 7 and Figure 9)
the stylised results of Ulyanov et al. [48] and Johnson et as objectively as possible. Here, we consider comparing
al. [47] are somewhat similar. This is not surprising since saliency maps, as proposed in [63]. The corresponding re-
they share a similar idea and only differ in their detailed sults are shown in Figure 6, Figure 8 and Figure 10. Saliency
network architectures. For the results of Li and Wand [52], maps can demonstrate visually dominant locations in im-
the results are sightly less impressive. Since [52] is based ages. Intuitively, a successful style transfer could weaken or
on Generative Adversarial Network (GAN), to some extent, enhance the saliency maps in content images, but should not
the training process is not that stable. But we believe that change the integrity and coherence. From Figure 6 (saliency
GAN-based style transfer is a very promising direction, and detection results of IOB-NST and PSPM-MOB-NST), it can
there are already some other GAN-based works [83], [87], be noticed that the stylised results of [4], [47], [48] preserve
[98] (Section 5) in the field of NST. the structures of content images well; however, for [52], it
might be harder for an observer to recognise the objects after
3) Results of MSPM-MOB-NST. Figure 7 demonstrates
stylisation. Using similar analytical method, from Figure 8
the results of Multiple-Style-Per-Model MOB-NST algorithms.
(saliency detection results of MSPM-MOB-NST), [53] and
Multiple styles are incorporated into a single model. The
[54] preserve similar saliency of the original content images
idea of both Dumoulin et al.’s algorithm [53] and Chen et
since they both tie a small number of parameters to each
al.’s algorithm [54] is to tie a small number of parameters to
style. [56] and [55] are also similar regarding the ability to
each style. Also, both of them build their algorithm upon the
retain the integrity of the original saliency maps, because
architecture of [47]. Therefore, it is not surprising that their
they both use a single network for all styles. As shown
results are visually similar. Although the results of [53], [54]
in Figure 10, for the saliency detection results of ASPM-
are appealing, their model size will become larger with the
MOB-NST, [58] and [51] perform better than [57] and [59];
increase of the number of learned styles. In contrast, Zhang
however, both [58] and [51] are data-driven methods and
and Dana’s algorithm [56] and Li et al.’s algorithm [55] use
their quality depends on the diversity of training styles.
a single network with the same trainable network weights
In general, it seems that the results of MSPM-MOB-NST
for multiple styles. The model size issue is tackled, but there
preserve better saliency coherence than ASPM-MOB-NST,
seem to be some interferences among different styles, which
but a little inferior to IOB-NST and PSPM-MOB-NST.
slightly influences the stylisation quality.
4) Results of ASPM-MOB-NST. Figure 9 presents the
last category of MOB-NST algorithms, namely Arbitrary- 6.3 Quantitative Evaluation
Style-Per-Model MOB-NST algorithms. Their idea is one- Regarding the quantitative evaluation, we mainly focus on
model-for-all. Globally, the results of ASPM are slightly less five evaluation metrics, which are: generating time for a
impressive than other types of algorithms. This is acceptable single content image of different sizes; training time for a
in that a three-way trade-off between speed, flexibility and single model; average loss for content images to measure
quality is common in research. Chen and Schmidt’s patch- how well the loss function is minimised; loss variation
based algorithm [57] seems to not combine enough style during training to measure how fast the model converges;
elements into the content image. Their algorithm is based style scalability to measure how large the learned style set
on similar patch swap. When lots of content patches are can be.
swapped with style patches that do not contain enough style 1) Stylisation speed. The issue of efficiency is the focus
elements, the target style will not be reflected well. Ghiasi of MOB-NST algorithms. In this subsection, we compare
et al.’s algorithm [58] is data-driven and their stylisation different algorithms quantitatively in terms of the stylisation
quality is very dependent on the varieties of training styles. speed. Table 2 demonstrates the average time to stylise one
For the algorithm of Huang and Belongie [51], they propose image with three resolutions using different algorithms. In
to match global summary feature statistics and successfully our experiment, the style images have the same size as the
improve the visual quality compared with [57]. However, content images. The fifth column in Table 2 represents the
their algorithm seems not good at handling complex style number of styles one model of each algorithm can produce.
patterns, and their stylisation quality is still related to the k(k ∈ Z + ) denotes that a single model can produce multiple
varieties of training styles. The algorithm of Li et al. [59] re- styles, which corresponds to MSPM algorithms. ∞ means
places the training process with a series of transformations. a single model works for any style, which corresponds to
ASPM algorithms. The numbers reported in Table 2 are
3. https://www.dropbox.com/s/5xd8iizoigvjcxz/ obtained by averaging the generating time of 100 images.
SupplementaryMaterial neuralStyleReview.pdf?dl=0 Note that we do not include the speed of [53], [58] in Table 2
16
Table 2: Average speed comparison of NST algorithms for images of size 256 × 256 pixels, 512 × 512 pixels and 1024 × 1024
pixels (on an NVIDIA Quadro M6000)
Note: The fifth column shows the number of styles that a single model can produce. Time both excludes (out of parenthesis) and includes (in
parenthesis) the style encoding process is shown, since [56], [57] and [51] support storing encoded style statistics in advance to further speed up
the stylisation process for the same style but different content images. Time of [57] for producing 1024 × 1024 images is not shown due to the
memory limitation. The speed of [53], [58] are similar to [47] since they share similar architecture. We do not redundantly list them in this table.
Table 3: A summary of the advantages and disadvantages of the mentioned algorithms in our experiment.
Note: E, AS, LF, and VQ represent Efficient, Arbitrary Style, Learning-Free, and Visual Quality, respectively. IOB-NST denotes the category
Image-Optimisation-Based Neural Style Transfer and MOB-NST represents Model-Optimisation-Based Neural Style Transfer.
as their algorithm is to scale and shift parameters based on a few iterations is capable of producing enough visually
the algorithm of Johnson et al. [47]. The time required to appealing results. So we just outline our training time of
stylise one image using [32], [53] is very close to [47] under different algorithms (under the same setting) as a reference
the same setting. For Chen et al.’s algorithm in [54], since for follow-up studies. On a NVIDIA Quadro M6000, the
their algorithm is protected by patent and they do not make training time for a single model is about 3.5 hours for the
public the detailed architecture design, here we just attach algorithm of Johnson et al. [47], 3 hours for the algorithm
the speed information provided by the authors for reference: of Ulyanov et al. [48], 2 hours for the algorithm of Li
On a Pascal Titan X GPU, 256×256: 0.007s; 512×512: 0.024s; and Wand [52], 4 hours for Zhang and Dana [56], and 8
1024 × 1024: 0.089s. For Chen and Schmidt’s algorithm [57], hours for Li et al. [55]. Chen and Schmidt’s algorithm [57]
the time for processing a 1024 × 1024 image is not reported and Huang and Belongie’s algorithm [51] take much longer
due to the limit of video memory. Swapping patches for (e.g., a couple of days), which is acceptable since a pre-
two 1024 × 1024 images needs more than 24 GB video trained model can work for any style. The training time
memory and thus, the stylisation process is not practical. of [58] depends on how large the training style set is. For
We can observe that except for [57], [59], all the other MOB- MSPM algorithms, the training time can be further reduced
NST algorithms are capable of stylising even high-resolution through incremental learning over a pre-trained model. For
content images in real-time. ASPM algorithms are generally example, the algorithm of Chen et al. only needs 8 minutes
slower than PSPM and MSPM, which demonstrates the to incrementally learn a new style, as reported in [54].
aforementioned three-way trade-off again.
3) Loss comparison. One way to evaluate some MOB-
2) Training time. Another concern is the training time for NST algorithms which share the same loss function is to
one single model. The training time of different algorithms compare their loss variation during training, i.e., the train-
is hard to compare as sometimes the model trained with just ing curve comparison. It helps researchers to justify the
17
ln( c)
ln( s)
ln( )
, W H U D W L R Q V , W H U D W L R Q V , W H U D W L R Q V
(a) Total Loss Curve (b) Style Loss Curve (c) Content Loss Curve
Figure 11: Training curves of total loss, style loss and content loss of different algorithms. Solid curves represent the loss
variation of the algorithm of Ulyanov et al. [48], while the dashed curves represent the algorithm of Johnson et al. [47].
Different colours correspond to different randomly selected styles from our style set.
* D W \ V et al.
8 O \ D Q R Y et al.
&