PhysRevX 14 031001

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

PHYSICAL REVIEW X 14, 031001 (2024)

How Deep Neural Networks Learn Compositional Data: The Random Hierarchy Model
Francesco Cagnetta ,1,*,† Leonardo Petrini ,1,* Umberto M. Tomasini ,1
Alessandro Favero ,1,2 and Matthieu Wyart1,‡
1
Institute of Physics, EPFL, Lausanne, Switzerland
2
Institute of Electrical Engineering, EPFL, Lausanne, Switzerland

(Received 19 December 2023; revised 15 April 2024; accepted 31 May 2024; published 1 July 2024)

Deep learning algorithms demonstrate a surprising ability to learn high-dimensional tasks from limited
examples. This is commonly attributed to the depth of neural networks, enabling them to build a hierarchy
of abstract, low-dimensional data representations. However, how many training examples are required to
learn such representations remains unknown. To quantitatively study this question, we introduce the
random hierarchy model: a family of synthetic tasks inspired by the hierarchical structure of language and
images. The model is a classification task where each class corresponds to a group of high-level features,
chosen among several equivalent groups associated with the same class. In turn, each feature corresponds to
a group of subfeatures chosen among several equivalent groups and so on, following a hierarchy of
composition rules. We find that deep networks learn the task by developing internal representations
invariant to exchanging equivalent groups. Moreover, the number of data required corresponds to the point
where correlations between low-level features and classes become detectable. Overall, our results indicate
how deep networks overcome the curse of dimensionality by building invariant representations and provide
an estimate of the number of data required to learn a hierarchical task.
DOI: 10.1103/PhysRevX.14.031001 Subject Areas: Complex Systems,
Interdisciplinary Physics,
Statistical Physics

I. INTRODUCTION significantly smaller than e50 ≈ 1020 . This immense differ-


Deep learning methods exhibit superhuman perfor- ence implies that learnable tasks are not generic, but highly
mances in areas ranging from image recognition [1] to structured. What is then the nature of this structure, and
Go playing [2]. However, despite these accomplishments, why are deep learning methods able to exploit it?
we still lack a fundamental understanding of their working A popular idea attributes the efficacy of these methods
principles. Indeed, Go configurations and images lie in to their ability to build a useful representation of the
high-dimensional spaces, which are hard to sample due data, which becomes increasingly complex across the
to the curse of dimensionality: The distance δ between layers [6]. Interestingly, a similar increase in complexity
neighboring data points decreases very slowly with their is also found in the visual cortex of the primate brain
number P, as δ ¼ OðP−1=d Þ, where d is the space dimen- [7,8]. In simple terms, neurons closer to the input learn to
sion. Solving a generic task such as regression of a detect simple features like edges in a picture, whereas
continuous function [3] requires a small δ, implying those deeper in the network learn to recognize more
that P must be exponential in the dimension d. Such a abstract features, such as faces [9,10]. Intuitively, if these
number of data is unrealistically large: For example, the representations are also invariant to aspects of the data
benchmark dataset ImageNet [4], whose effective dimen- unrelated to the task, such as the exact position of an
sion is estimated to be ≈50 [5], consists of only ≈107 data, object in a frame for image classification [11], they may
effectively reduce the dimensionality of the problem and
make it tractable. This view is supported by several
* empirical studies of the hidden representations of trained
These authors contributed equally to this work.

Corresponding author: francesco.cagnetta@epfl.ch networks. In particular, measures such as the mutual

Corresponding author: matthieu.wyart@epfl.ch information between such representations and the input
[12,13], their intrinsic dimensionality [14,15], and their
Published by the American Physical Society under the terms of sensitivity toward transformations that do not affect the
the Creative Commons Attribution 4.0 International license.
Further distribution of this work must maintain attribution to task (e.g., smooth deformations for image classification
the author(s) and the published article’s title, journal citation, [16,17]), all eventually decay with the layer depth.
and DOI. However, none of these studies addresses the sample

2160-3308=24=14(3)=031001(24) 031001-1 Published by the American Physical Society


FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)

complexity, i.e., the number of training data necessary for low-dimensional manifold and (ii) the task being
learning such representations, and thus the task. smooth [33]. For instance, in the context of regression,
In this paper, we study the relationship between sample the sample complexity is not controlled by the bare input
complexity, depth of the learning method, and structure of dimensionality d, but by the ratio dM =s [34–36], where dM
the data by focusing on tasks with a hierarchically composi- is the dimension of the data manifold and s the number of
tional structure—arguably a key property for the learn- bounded derivatives of the target function. However, dM is
ability of real data [18–25]. To provide a concrete example, also large in practice [5]; thus, keeping dM =s low requires
consider a picture that consists of several high-level an unrealistically large number of bounded derivatives.
features like face, body, and background. Each feature is Moreover, properties (i) and (ii) can already be leveraged
composed of subfeatures like ears, mouth, eyes, and nose by isotropic kernel methods, and thus cannot account for
for the face, which can be further thought of as combina- the significant advantage of deep learning methods in many
tions of low-level features such as edges [26]. Recent benchmark datasets [37]. Alternatively, learnability can be
studies have revealed that deep networks can represent achieved when (iii) the task depends on a small number of
hierarchically compositional functions with far fewer linear projections of the input variables, such as regression
parameters than shallow networks [21], implying an of a target function f  ðxÞ ¼ gðxt Þ, where x ∈ Rd and
information-theoretic lower bound on the sample complex- xt ∈ Rt [38–41]. Methods capable of learning features
ity which is only polynomial in the input dimension [24]. from the data can leverage this property to achieve a
While these works offer important insights, they do not sample complexity that depends on t instead of d [42].
characterize the performance of deep neural networks However, one-hidden-layer networks are sufficient for that;
trained with gradient descent (GD). hence, this property does not explain the need for deep
We investigate this question by adopting the physicist’s architectures.
approach [27–32] of introducing a model of synthetic data, In the context of statistical physics, the quest for a model
which is inspired by the structure of natural problems, of data structure has been pursued within the framework
yet simple enough to be investigated systematically. This of teacher-student models [43–45], where a teacher uses
model (Sec. II) belongs to a family of hierarchical classi- some ground truth knowledge to generate data, while a
fication problems where the class labels generate the input student tries to infer the ground truth from the data. The
data via a hierarchy of composition rules. These problems structural properties (i)–(iii) can be incorporated into this
were introduced to highlight the importance of input-to- approach [28,46]. In addition, using a shallow convolu-
label correlations for learnability [19] and were found to be tional network as a teacher allows for modeling (iv) the
learnable via an iterative clustering algorithm [22]. Under locality of imagelike datasets [31,47,48]. In the context of
the assumption of randomness of the composition rules, regression,
we show empirically that shallow networks suffer from P this property can be modeled with a function
f  ðxÞ ¼ i f i ðxi Þ where the sum is on all patches xi of t
the curse of dimensionality (Sec. III), whereas the sample adjacent pixels. Convolutional networks learn local tasks
complexity P of deep networks (both convolutional net- with a sample complexity controlled by the patch dimen-
works and multilayer perceptrons) is only polynomial in sion t [47], even in the “lazy” regime [49,50] where they do
the size of the input. More specifically, with nc classes not learn features. However, locality does not allow for
and L composition rules that associate m equivalent low- long-range nonlinear dependencies in the task. It might
level representations to each class or high-level feature, be tempting to include these dependencies by considering
P ≃ nc mL asymptotically in m (Sec. III). a deep convolutional teacher network, but then the
Furthermore, we find that P coincides with both (a) the sample complexity would be exponential in the input
number of data that allows for learning a representation dimension d [25].
that is invariant to exchanging the m semantically equiv- The present analysis based on hierarchical generative
alent low-level features (Sec. III A) and (b) the size of the models shows that properties (i)–(iii) are not necessary to
training set for which the correlations between low-level beat the curse of dimensionality. Indeed, for some choices
features and class label become detectable (Sec. IV). We of the parameters, the model generates all possible
prove for a simplified architecture trained with gradient d-dimensional sequences of input features, which violates
descent that (a) and (b) must indeed coincide. Via (b), P (i). Additionally, changing a single input feature has a finite
can be derived analytically under our assumption of probability of changing the label, violating the smoothness
randomness of the composition rules. assumption (ii). Finally, the label depends on all of the d
input variables of the input, violating (iii). Yet, we find that
A. Relationship to other models of data structure the sample complexity of deep neural networks is only
Characterizing the properties that make high- polynomial in d. Since locality is incorporated hierarchi-
dimensional data learnable is a classical problem in cally in the generative process, it generates long-range
statistics. Typical assumptions that allow for avoiding dependencies in the task, but it can still be leveraged by
the curse of dimensionality include (i) data lying on a building a hierarchical representation of the data.

031001-2
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)

FIG. 1. The random hierarchy model. Left: structure of the generative model. The class label α ¼ 1; …; nc generates a set of m
equivalent (i.e., synonymic) high-level representations with elements taken from a vocabulary of high-level features V L . Similarly,
high-level features generate m equivalent lower-level representations, taken from a vocabulary V L−1. Repeating this procedure L − 2
times yields all the input data with label α, consisting of low-level features taken from V 1 . Right: example of random hierarchy
model with nc ¼ 2 classes, L ¼ 3, s ¼ 2, m ¼ 3, and homogeneous vocabulary size v1 ¼ v2 ¼ v3 ¼ 3. The three sets of rules are
listed at the top, while two examples of data generation are shown at the bottom. The first example is obtained by following the rules
in the colored boxes.

II. RANDOM HIERARCHY MODEL Following the analogy with language, we refer to these
equivalent representations as synonyms. We assume that a
In this section, we introduce our generative model, which
single low-level representation there is only one high-level
can be thought of as an L-level context-free grammar—
feature that generates it, i.e., that there are no ambiguities.
a generative model of language from formal language
Since the number of distinct s-tuples at level l is bounded
theory [51]. The model consists of a set of class labels
by vsl, this assumption requires mvlþ1 ≤ vsl for all
C ≡ f1; …; nc g and L disjoint vocabularies V l ≡
l ¼ 1; …; L (with vLþ1 ≡ nc ). If m ¼ 1, each label gen-
fal1 ; …; alvl g of low- and high-level features. As illustrated erates only a single datum and the model is trivial. For
in Fig. 1, left-hand panel, data are generated from the class m > 1, the number of data per class grows exponentially
labels. Specifically, each label generates m distinct high- with the input dimension d ¼ sL :
level representations via m composition rules of the form
PL−1
si
m × ms ×    × ms ¼m ¼ mðd−1Þ=ðs−1Þ : ð3Þ
L−1
ðLÞ ðLÞ ðLÞ i¼0
α↦ μ1 ; …; μs for α ∈ C and μi ∈ V L ; ð1Þ

having size s > 1. The s elements of these representations In particular, in the case where mvlþ1 ¼ vsl , the model
ðLÞ generates all the possible data made of d features in V 1 .
are high-level features μi such as background, face, and Instead, for mvlþ1 < vsl, the set of available input data is
body for a picture. Each high-level feature generates in turn given by the application of the composition rules; therefore,
m lower-level representations via other m rules, it inherits the hierarchical structure of the model.
Let us remark that, due to the nonambiguity assumption,
ðl−1Þ ðl−1Þ ðl−1Þ
μðlÞ ↦ μ1 ;…;μs for μðlÞ ∈V l ; μi ∈V l−1 ; ð2Þ each set of composition rules can be summarized with a
function gl that associates s-tuples of level-l features to the
from l ¼ L down to l ¼ 1. The input features μð1Þ corresponding level-(l þ 1) feature. The domain of gl is a
represent low-level features such as the edges in an image. subset of V sl consisting of the mvlþ1 s-tuples generated by
Because of the hierarchical structure of the generative the features at level (l þ 1). Using these functions, the label
ð1Þ ð1Þ
process, each datum can be represented as a tree of α ≡ μðLþ1Þ of an input datum μð1Þ ¼ ðμ1 ; …; μd Þ can be
branching factor s and depth L, where the root is the class written as a hierarchical composition of L local functions
label, the leaves are the input features, and the hidden nodes of s variables [20,21]:
are the level-l features with l ¼ 2; …; L.
In addition, for each level l, there are m distinct rules  
ðlþ1Þ ðlÞ ðlÞ
emanating from the same higher-level feature μðlÞ ; i.e., μi ¼ gl μði−1Þsþ1 ; …; μði−1Þsþ1 ; ð4Þ
there are m equivalent lower-level representations of μðlÞ
(see Fig. 1, right-hand panel, for an example with m ¼ 3). for i ¼ 1; …; sL−l and l ¼ 1; …; L.

031001-3
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)

Note that, while we keep s and m constant throughout


the levels for ease of exposition, our results can be
generalized without additional effort. Likewise, we will
set the vocabulary size to v for all levels. To sum up, a
single classification task is specified by the parameters nc ,
v, m, and s and by the L composition rules. In the random
hierarchy model (RHM) the composition rules are chosen
uniformly at random over all the possible assignments of m
representations of s low-level features to each of the v
high-level features. An example of binary classification
task (nc ¼ 2), with s ¼ 2, L ¼ 3, and v ¼ m ¼ 3, is shown
in Fig. 1, right-hand panel, together with two examples of
label-input pairs. Notice that the random choice induces
correlations between low- and high-level features. In
simple terms, each of the high-level features, e.g., the
level-2 features d, e, or f in the figure, is more likely to be
represented with a certain low-level feature in a given
position, e.g., i on the right for d, g on the right for e, and h
on the right for f. These correlations are crucial for our
predictions and are analyzed in detail in Appendix C. FIG. 2. Sample complexity of two-layer fully connected net-
works, for L ¼ s ¼ 2 and v ¼ nc ¼ m. Top: test error versus the
number of training data. Different colors correspond to different
III. SAMPLE COMPLEXITY vocabulary sizes v. Bottom: number of training data resulting in
OF DEEP NEURAL NETWORKS test error ϵ̄ ¼ 0.7 as a function of Pmax , with the black dashed line
The main focus of our work is the answer to the indicating a linear relationship.
following question.
How much data is required to learn a typical instance of trained on a significant fraction of the total number of
the random hierarchy model with a deep neural network? data Pmax . From Eq. (3),
Thus, after generating an instance of the RHM with fixed
parameters nc , s, m, v, and L, we train neural networks of Pmax ¼ nc mðd−1Þ=ðs−1Þ ; ð6Þ
varying depth with stochastic gradient descent (SGD) on a L
set of P training points. The training points are sampled which equals vs in the maximal case. The bottom
uniformly at random without replacement from the set of panel of Fig. 2, in particular, highlights that the
available RHM data; hence, they are all distinct. We adopt a number of training data required for having a test
one-hot encoding of the input features, so that each input error ϵ ≤ 0.7ϵrand, with ϵrand ¼ 1 − n−1c denoting the
point x is a (d × v)-dimensional sequence where, for error of a random guess of the label, is proportional to
i ¼ 1; …; d and ν ∈ V 1 , Pmax . Since Pmax is exponential in d, this is an instance
of the curse of dimensionality.
( (b) Deep networks break the curse. For networks having a
ð1Þ
1 if μi ¼ ν
xi;ν ¼ ð5Þ depth larger than that of the RHM L, the test error
0 otherwise: displays a sigmoidal behavior as a function of the
training set size. This finding is illustrated in the top
All our experiments consider overparametrized networks, panels of Figs. 3 and 4 (and Fig. 12 of Appendix F for
which we achieve in practice by choosing the width H of varying nc ) for convolutional neural networks (CNNs)
the network’s hidden layers such that (i) training loss of depth L þ 1 (details in Appendix A). Similar results
reaches 0 and (ii) test accuracy does not improve by are obtained for multilayer perceptions of depth > L,
increasing H. To guarantee representation learning as H as shown in Appendix F. All these results suggest the
grows, we consider the maximal update parametrization existence of a well-defined number of training data at
[52], equivalent to having the standard H−1=2 scaling of the which the task is learned. Mathematically, we define
hidden layer weights plus an extra factor of H−1=2 at the last the sample complexity P as the smallest training set
layer. Further details of the machine learning methods can size P such that the test error ϵðPÞ is smaller than
be found in Appendix A. ϵrand =10. The bottom panels of Figs. 3 and 4 (and
(a) Shallow networks are cursed. Let us begin with Figs. 12 and 13) show that
the sample complexity of two-layer fully connected
P
networks. As shown in Fig. 2, in the maximal case P ≃ nc mL ⇔ ≃ dlnðmÞ= lnðsÞ ; ð7Þ
nc ¼ v, m ¼ vs−1 these networks learn the task only if nc

031001-4
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)

compositionality of the task. It is crucial to note,


however, that this ability manifests only in feature
learning regimes, e.g., under the maximal update
parametrization considered in this work. Conversely,
as shown in Fig. 14 of Appendix F for the maximal
case nc ¼ v, m ¼ vs−1 , deep networks trained in
the “lazy” regime [49]—where they do not learn
features—suffer from the curse of dimensionality,
even when their architecture is matched to the struc-
ture of the RHM.
We now turn to study the internal representations of
trained networks and the mechanism that they employ to
solve the task.

A. Emergence of synonymic invariance


in deep CNNs
A natural approach to learning the RHM would be to
identify the sets of s-tuples of input features that correspond
FIG. 3. Sample complexity of depth-(L þ 1) CNNs, for s ¼ 2 to the same higher-level feature, i.e., synonyms. Identifying
and m ¼ nc ¼ v. Top: test error versus number of training points. synonyms at the first level would allow for replacing each
Different colors correspond to different vocabulary sizes v while s-dimensional patch of the input with a single symbol,
the markers indicate the hierarchy depth L. Bottom: sample
complexity P corresponding to a test error ϵ ¼ 0.1ϵrand. The
reducing the dimensionality of the problem from sL to sL−1 .
empirical points show remarkable agreement with the law Repeating this procedure L times would lead to the class
P ¼ nc mL , shown as a black dashed line. labels and, consequently, to the solution of the task.
To test if deep networks trained on the RHM resort to a
independently of the vocabulary size v. Since P is a similar solution, we introduce the synonymic sensitivity,
which is a measure of the invariance of a function with
power of the input dimension d ¼ sL , the curse
respect to the exchange of synonymic low-level features.
of dimensionality is beaten, which evidences the
ability of deep networks to harness the hierarchical Mathematically, we define Sk;l as the sensitivity of the
kth layer representation of a deep network with respect to
exchanges of synonymous s-tuples of level-l features.
Namely,

hkf k ðxÞ − f k ðPl xÞk2 ix;Pl


Sk;l ¼ ; ð8Þ
hkf k ðxÞ − f k ðyÞk2 ix;y

where f k is the sequence of activations of the kth layer in


the network, Pl is an operator that replaces all the level-l
tuples with one of their m − 1 synonyms chosen uniformly
at random, h·i with subscripts x, y denotes average over
pairs of input data of an instance of the RHM, and the
subscript Pl denotes average over all the exchanges of
synonyms.
Figure 5 reports S2;1 , which measures the sensitivity to
exchanges of synonymic tuples of input features, as a
function of the training set size P for deep CNNs trained on
RHMs with different parameters. We focused on S2;1 —the
sensitivity of the second layer of the network—since a
single linear transformation of the input cannot produce an
FIG. 4. Sample complexity of depth-(L þ 1) CNNs, for s ¼ 2,
nc ¼ v and varying m ≤ v. Top: test error versus number of invariant representation in general [53]. Note that all the
training points, with different colors corresponding to different curves display a sigmoidal shape, signaling the existence of
vocabulary sizes v and markers indicating the hierarchy depth L. a characteristic sample size which marks the emergence
Bottom: sample complexity P, with the law P ¼ nc mL shown of synonymic sensitivity in the learned representations.
as a black dashed line. Remarkably, by rescaling the x axis by the sample

031001-5
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)

level-1 synonyms. Figure 6 also shows the sensitivities to


exchanges of higher-level synonyms: All levels are learned
together as P increases, and invariance to level-l exchanges
is achieved from layer k ¼ l þ 1. The test error is also
shown (gray dashed lines) to further emphasize its corre-
lation with synonymic invariance.
(a) Synonymic invariance and effective dimension. Note
that the collapse of the representations of synonymic
tuples to the same value implies a progressive reduc-
tion of the effective dimensionality of the hidden
representations, as reported in Fig. 11 of Appendix E.

IV. CORRELATIONS GOVERN SYNONYMIC


INVARIANCE
We now provide a theoretical argument for understand-
ing the scaling of P of Eq. (7) with the parameters of the
RHM. First, we compute a third characteristic sample size
Pc , defined as the size of the training set for which the local
FIG. 5. Synonymic sensitivity S2;1 for a depth-(L þ 1) CNN correlations between any of the input patches and the label
trained on the RHM with s ¼ 2, nc ¼ m ¼ v as a function of the
training set size (L and v as in the key). The collapse achieved
become detectable. Remarkably, Pc coincides with P of
after rescaling by P ¼ nc mL highlights that the sample complex- Eq. (7). Second, we demonstrate how a shallow (two-layer)
ity coincides with the number of training points required to build neural network acting on a single patch can use such
internal representations invariant to exchanging synonyms. correlations to build a synonymic invariant representation
in a single step of gradient descent so that Pc and P also
correspond to the emergence of an invariant representation.
complexity of Eq. (7) (bottom panel of Fig. 5), curves Last, we show empirically that removing such correlations
corresponding to different parameters collapse. We con- leads again to the curse of dimensionality, even if the
clude that the generalization ability of a network relies on network architecture is matched to the structure of the RHM.
the synonymic invariance of its hidden representations.
Measures of the synonymic sensitivity Sk;1 for different
layers k are reported in Fig. 6 (blue lines), showing indeed A. Identify synonyms by counting
that the layers k ≥ 2 become insensitive to exchanging Groups of input patches forming synonyms can be
inferred by counting, at any given location, the occurrences
of such patches in all the data corresponding to a given
class α. Indeed, tuples of features that appear with identical
frequencies are likely synonyms. More specifically, let us
denote xj an s-dimensional input patch for j in 1; …; sL−1 ,
an s-tuple of input features with μ ¼ ðμ1 ; …; μs Þ, and the
number of data in class α having xj ¼ μP with N j ðμ; αÞ [54].
Normalizing this number by N j ðμÞ ¼ α N j ðμ; αÞ yields
the conditional probability f j ðαjμÞ for a datum to belong
to class α conditioned on displaying the s-tuple μ in the
jth input patch:

N j ðμ; αÞ
f j ðαjμÞ ≔ Prfx ∈ αjxj ¼ μg ¼ : ð9Þ
N j ðμÞ
FIG. 6. Synonymic sensitivities Sk;l of the layers of a depth-
(L þ 1) CNN trained on an RHM with L ¼ 3, s ¼ 2, nc ¼ m ¼ If the low-level features are homogeneously spread across
v ¼ 8, as a function of the training set size P. The colors denote classes, then f ¼ n−1c for all α, μ, and j. In contrast, due to
the level of the exchanged synonyms (as in the key), whereas
the aforementioned correlations, the probabilities of the
different panels correspond to the sensitivity of the activations
of different layers (layer index in the gray box). Synonymic RHM are all different from n−1 c —we refer to this difference
invariance is learned at the same training set size for all layers, as signal. Distinct level-1 tuples μ and ν yield a different f
and invariance to level-l exchanges is obtained from layer (and thus a different signal) with high probability unless μ
k ¼ l þ 1. and ν are synonyms, i.e., they share the same level-2

031001-6
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)

representation. Therefore, this signal can be used to identify displaying feature μ in the jth patch. Therefore, by the
synonymous level-1 tuples. convergence of the empirical measure to the true proba-
bility, the sampling fluctuations of f̂ relative to the mean
B. Signal versus sampling noise are of order ½P=ðnc mvÞ−1=2 —see Appendix C for a
When measuring the conditional class probabilities with detailed derivation. Balancing signal and noise yields
only P training data, the occurrences in the right-hand side the characteristic Pc for the emergence of correlations.
of Eq. (9) are replaced with empirical occurrences, which For large m, nc , and P,
induce a sampling noise on the f’s. For the identification of
synonyms to be possible, this noise must be smaller in Pc ¼ nc mL ; ð10Þ
magnitude than the aforementioned signal—a visual rep-
resentation of the comparison between signal and noise is which coincides with the empirical sample complexity of
depicted in Fig. 7. deep networks discussed in Sec. III.
The magnitude of the signal can be computed as the ratio
between the standard deviation and mean of f j ðαjμÞ over C. Learning level-1 synonyms with one step
realizations of the RHM. The full calculation is presented in of gradient descent
Appendix C: Here we present a simplified argument based To complete the argument, we consider a simplified
on an additional independence assumption. Given a class α, one-step gradient descent setting [55,56], where Pc marks
the tuple μ appearing in the jth input patch is determined the number of training examples required to learn a
by a sequence of L choices—one choice per level of the
synonymic invariant representation. In particular, we
hierarchy—of one among m possible lower-level repre-
focus on the s-dimensional patches of the data and study
sentations. These mL possibilities lead to all the mv distinct how a two-layer network acting on one of such patches
input s-tuples. N j ðμ; αÞ is proportional to how often the learns the first composition rule of the RHM by building
tuple μ is chosen—mL =ðmvÞ times on average. Under a representation invariant to exchanges of level-1
the assumption of independence of the mL choices, the synonyms.
fluctuations of N j ðμ; αÞ relative to its mean are given by the Let us then sample an instance of the RHM and P input-
central limit theorem and read ½mL =ðmvÞ−1=2 in the limit of label pairs ðxk;1 ; αk Þ with αk ≔ αðxk Þ for all k ¼ 1; …; P
large m. If nc is sufficiently large, the fluctuations of N j ðμÞ and xk;1 denoting the first s patch of the datum xk . The
are negligible in comparison. Therefore, the relative fluc- network output reads
tuations of f j are the same as those of N j ðμ; αÞ, and the size
of the signal is ½mL =ðmvÞ−1=2 . 1X H

The magnitude of the noise is given by the ratio between F NN ðx1 Þ ¼ a σðwh · x1 Þ; ð11Þ
H h¼1 h
the standard deviation and mean, over independent sam-
plings of a training set of fixed size P, of the empirical where the inner-layer weights wh ’s have the same dimen-
conditional probabilities f̂ j ðαjμÞ. Only P=ðnc mvÞ of the sion as x1 , the top-layer weights ah ’s are nc dimensional,
training points will, on average, belong to class α while and σðxÞ ¼ max ð0; xÞ is the rectified linear unit (ReLU)
activation function. To further simplify the problem, we
represent x1 as a vs -dimensional one-hot encoding of the
corresponding s-tuple of features. This representation is
equivalent to an orthogonalization of the input points.
In addition, the top-layer weights are initialized as i.i.d.
Gaussian with zero mean and unit variance and fixed,
whereas the wh ’s are initialized with all their elements set to
1 and trained by gradient descent on the empirical cross-
entropy loss:
FIG. 7. Signal versus noise illustration. The dashed function
P   ½F NN ðxk;1 Þαðx Þ 
represents the distribution of fðαjμÞ resulting from the random
sampling of the RHM rules. The solid dots illustrate the true 1X e k
L¼ − log Pnc ½F ðx Þ : ð12Þ
frequencies fðαjμÞ sampled from this distribution, with different P k¼1 β¼1 e
NN k;1 β

colors corresponding to different groups of synonyms. The


typical spacing between the solid dots, given by the width of
Finally, we consider the mean-field limit W → ∞, so that,
the distribution, represents the signal. Transparent dots represent ð0Þ
the empirical frequencies f̂ j ðαjμÞ, with dots of the same color at initialization, F NN ¼ 0 identically.
corresponding to synonymous features. The spread of transparent Let us denote with μðx1 Þ the s-tuple of features encoded
dots of the same color, which is due to the finiteness of the in x1 . Because of the one-hot encoding, f h ðx1 Þ ≔ wh · x1
training set, represents the noise. coincides with the μðx1 Þth component of the weight wh .

031001-7
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)

This component, which is set to 1 at initialization, is of shallow fully connected networks trained in the setting
updated by (minus) the corresponding component of the of this section, as a function of the number P of training
gradient of the loss in Eq. (12). Recalling also that the data for different combinations of the model parameters.
predictor is 0 at initialization, we get The bottom panel, in particular, highlights that the sensi-
tivity is close to 1 for P ≪ Pc and close to 0 for P ≫ Pc. In
Δf h ðx1 Þ ¼ −∇ðwh Þμðx Þ L addition, notice that the collapse of the preactivations of
1
  synonymic tuples onto the same, synonymic invariant
1X
P X
nc
1 value, implies that the rank of the hidden weights matrix
¼ a δ δ −
P k¼1 α¼1 h;α μðx1 Þ;μðxk;1 Þ α;αðxk Þ nc tends to v—the vocabulary size of higher-level features.
Xnc   This low-rank structure is typical in the weights of deep
N̂ 1 ðμðx1 Þ; αÞ 1 N̂ 1 ðμÞ networks trained on image classification [58–61].
¼ ah;α − ; ð13Þ
α¼1
P nc P (a) Including all patches via weight sharing. Let us
remark that one can easily extend the one-step setting
where N̂ 1 ðμÞ is the empirical occurrence of the s-tuple μ in to include the information from all the input patches,
the first patch of the P training points and N̂ 1 ðμ; αÞ is the for instance, by replacing the network in Eq. (11) with
(empirical) joint occurrence of the s-tuple μ and the class a one-hidden-layer convolutional network with filter
size s and nonoverlapping patches. Consequently, the
label α. As P increases, the empirical occurrences N̂
empirical occurrences on the right-hand side of
converge to the true occurrences N, which are invariant
Eq. (13) would be replaced with average occurrences
for the exchange of synonym s-tuples μ. Hence, the hidden
over the patches. However, this average results in a
representation is also invariant for the exchange of syno-
reduction of both the signal and the sampling noise
nym s-tuples in this limit.
contributions
pffiffiffiffiffiffiffiffiffito the empirical occurrences by the same
This prediction is confirmed empirically in Fig. 8, which
shows the sensitivity S1;1 of the hidden representation [57] factor sL−1. Therefore, weight sharing does not
affect the sample size required for synonymic invari-
ance in the one-step setting.
(b) Improved sample complexity via clustering. A distance-
based clustering method acting on the representations
of Eq. (13) can actually identify synonyms at
pffiffiffiffiffi pffiffiffiffiffi
P ≃ nc mL ¼ Pc = nc , which is much smaller than
Pc in the large-nc limit. Intuitively, using a sequence
instead of a scalar amplifies the signal by a factor nc and
pffiffiffiffiffi
the sampling noise by a factor nc, improving the
signal-to-noise ratio. We show that this is indeed
the case in Appendix D for the maximal dataset case
nc ¼ v and m ¼ vs−1 . Previous theoretical studies have
considered the possibility of intercalating clustering
steps in standard gradient descent methods [22,62], but
the question of whether deep learning methods can
achieve a similar sample complexity with standard end-
to-end training remains open.

D. Curse of dimensionality without correlations


To support the argument that learning is possible because
of the detection of local input-label correlations, we show
that their removal in the RHM leads to a sample complexity
exponential in d, even for deep networks. Removing such
correlations implies that, at any level, features are uni-
FIG. 8. Synonymic sensitivity of the hidden representation formly distributed among classes. This is achieved enforc-
versus P for a two-layer fully connected network trained on the ing that a tuple μ in the jth patch at level l belongs to a
first patch of the inputs of an RHM with s ¼ 2 and m ¼ v, for class α with probability n−1
c , independently on μ, j, l, and
varying L, v, and nc . The top panel shows the bare curves
α, as discussed in Sec. IVA. Such procedure produces an
whereas, in the bottom panel, the x axis is rescaled by
Pc ¼ nc mL . The collapse of the rescaled curves highlights that
uncorrelated version of the RHM, which generalizes
Pc coincides with the number of training data for building a the parity problem (realized for m ¼ v ¼ nc ¼ 2), a task
synonymic invariant representation. that cannot be learned efficiently with gradient-based

031001-8
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)

result gives a foundation to the claim that deep is better than


shallow, beyond previous analyses that focused on expres-
sivity [21,24] rather than learning. On the other hand, our
result that the internal representations of trained networks
mirror the hierarchical structure of the task explains why
these representations become increasingly complex with
depth in real-world applications [9,10].
Furthermore, we provided a characterization of the
internal representations based on their sensitivity toward
transformations of the low-level features that leave the class
label unchanged. This viewpoint complements existing
ones that focus instead on the input features that maximize
the response of hidden neurons, thus enhancing the
interpretability of neural nets. In addition, our approach
bypasses several issues of previous characterizations. For
example, approaches based on mutual information [12] are
ill defined when the network representations are determin-
FIG. 9. Test error of depth-(L þ 1) CNNs trained on uncorre- istic functions of the input [13], whereas those based on
lated RHM versus number P of training points rescaled with intrinsic dimension [14,15] can display counterintuitive
Pmax , with s ¼ 2 and m ¼ nc ¼ v with different v (different
results—see Appendix E for a deeper discussion of the
colors), for L ¼ 2 (top) and L ¼ 3 (bottom). Horizontal dashed
lines stand for ϵrand, given by guessing the label uniformly at intrinsic dimension and on how it behaves in our
random. framework.
Finally, our study predicts a fundamental relationship
between sample complexity, correlations between low-level
methods [63]. Indeed, deep CNNs with depth L þ 1,
features and labels, and the emergence of invariant repre-
trained on this uncorrelated RHM, are cursed by dimen-
sentations. This prediction can be tested beyond the context
sionality, as shown in Fig. 9. The CNN test error is close to
of our model, for instance, by studying invariance to
ϵrand , given by randomly guessing the label, even for
exchanging synonyms in language modeling tasks.
P=Pmax > 0.9, particularly for v > 2.
Looking forward, the random hierarchy model is a
suitable candidate for the clarification of other open
V. CONCLUSION questions in the theory of deep learning. For instance,
What makes real-world tasks learnable? This question a formidable challenge is to obtain a detailed description
extends from machine learning to brain science [64]. To of the gradient descent dynamics of deep networks.
start thinking quantitatively about it, we introduced the Indeed, dynamics may be significantly easier to analyze
random hierarchy model: a family of tasks that captures the in this model, since quantities characterizing the network
compositional structure of natural data. We showed that success, such as sensitivity to synonyms, can be delin-
neural networks can learn such tasks with a limited training eated. In addition, the model could be generalized to
set, by developing a hierarchical representation of the data. describe additional properties of data, e.g., noise in the
Overall, these results rationalize several phenomena asso- form of errors in the composition rules or inhomogene-
ciated with deep learning. ities in the frequencies at which high-level features
First, our finding that for hierarchical tasks the sample generate low-level representations. The latter, in particu-
complexity is polynomial in the input dimension (and not lar, would generate data where certain input features are
exponential) leads to a plausible explanation for the more abundant than others and, possibly, to a richer
learnability of real-world tasks. Moreover, our results learning scenario with several characteristic training
provide a rule of thumb for estimating the order of set sizes.
magnitude of the sample complexity of benchmark data- Beyond supervised learning, in the random hierarchy
sets. In the case of CIFAR10 [65], for instance, having model the set of available input data inherits the hierar-
10 classes, taking reasonable values for task parameters chical structure of the generative process. Thus, this model
such as m ∈ ½5; 15 and L ¼ 3, yields P ∈ ½103 ; 3 × 104 , offers a new way to study the effect of compositionality
comparable with the sample complexity of modern archi- on self-supervised learning or probabilistic generative
tectures (see Fig. 15). models—extremely powerful techniques whose under-
Second, our results quantify the intuition that depth is standing is still in its infancy.
crucial to building a hierarchical representation that effec-
tively lowers the dimension of the problem, and allows for The codes that support the findings of this paper are
avoiding the curse of dimensionality. On the one hand, this openly available [66–68].

031001-9
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)

ACKNOWLEDGMENTS To tailor deep CNNs to the structure of the RHM, we set


f ¼ s so that, in the nonoverlapping patches setting, each
The authors thank Antonio Sclocchi for fruitful discus-
convolutional filter acts on a group of s low-level features
sions and helpful feedback on the manuscript. This work
that correspond to the same higher-level feature. Since the
was supported by a grant from the Simons Foundation
spatial dimensionality of the input is sL and each layer
(No. 454953 for M. W.).
reduces it by s, the number of nonlinear layers in a tailored
CNN is fixed to the depth of the RHM L, so that the
APPENDIX A: METHODS network depth is L þ 1. Fully connected networks, instead,
1. RHM implementation can have any depth. The code for the implementation of
both architectures is available online [67].
The code implementing the RHM is available online [66].
The inputs sampled from the RHM are represented as a
3. Training procedure
one-hot encoding of low-level features so that each input
consists of sL pixels and v channels (size sL × v). The input Training is performed within the PYTORCH deep learning
pixels are whitened over channels; i.e., each pixel has zero framework [70]. Neural networks are trained on P training
mean and unit variance over the channels. points sampled uniformly at random from the RHM data,
using stochastic gradient descent on the cross-entropy loss.
2. Machine learning models The batch size is 128 for P ≥ 128 and P otherwise, the
learning rate is initialized to 10−1 and follows a cosine
We consider both generic deep neural networks and deep annealing schedule which reduces it to 10−2 over 100
convolutional networks tailored to the structure of the RHM.
epochs. Training stops when the training loss reaches 10−3 .
Generic deep neural networks are made by stacking fully
The corresponding code is available online [68].
connected layers, i.e., linear transformations of the kind
The performance of the trained models is measured as
the classification error on a test set. The size of the test set is
x ∈ Rdin → d−1=2
in W · x þ b ∈ R
dout
; ðA1Þ
set to minðPmax − P; 20 000Þ. Synonymic sensitivity, as
where W is a dout × din matrix of weights, b is a dout sequence defined in Eq. (8), is measured on a test set of size
minðPmax − P; 1000Þ. Reported results for a given value
of biases, and the factor d−1=2
in guarantees that the outputs
of RHM parameters are averaged over 10 jointly different
remain of order 1 when din is varied. Convolutional layers,
instances of the RHM and network initialization.
instead, act on imagelike inputs that have a spatial dimension
d and cin channels and compute the convolution of the input
with a filter of spatial size f. This operation is equivalent to APPENDIX B: STATISTICS OF THE
applying the linear transformation of Eq. (A1) to input COMPOSITION RULES
patches of spatial size f, i.e., groups of f adjacent pixels
In this appendix, we consider a single composition rule,
[dimension din ¼ ðf × cin Þ]. The output has an imagelike
that is the assignment of m s-tuples of low-level features to
structure analogous to that of the input, with spatial dimen-
each of the v high-level features. In the RHM these rules are
sion depending on how many patches are considered. In the
chosen uniformly at random over all the possible rules; thus
nonoverlapping patches case, for instance, the spatial dimen-
their statistics are crucial in determining the correlations
sion of the output is d=f.
between the input features and the class label.
For all layers but the last, the linear transformation is
followed by an elementwise nonlinear activation function
σ. We resort to the popular ReLU σðxÞ ¼ max ð0; xÞ. The 1. Statistics of a single rule
output dimension is always fixed to the number of classes For each rule, we call N i ðμ1 ; μ2 Þ the number of occur-
nc , while the input dimension of the first layer is the same rences of the low-level feature μ1 in position i of the
as the input data: spatial dimension sL and v channels, s-tuples generated by the higher-level feature μ2 . The
flattened into a single sL × v sequence when using a fully probability of N i ðμ1 ; μ2 Þ is that of the number of successes
connected layer. The dimensionalities of the other hidden when drawing m (number of s-tuples associated with the
layers are set to the same constant H throughout the high-level feature μ2 ) times without replacement from a
network. Following the maximal update parametrization pool of vs (total number of s-tuples with vocabulary size v)
[69], the weights of the last layer are multiplied by an objects where only vs−1 satisfy a certain condition (number
additional factor H−1. This factor causes the output at of s-tuples displaying feature μ1 in position i):
initialization to vanish as H grows, which induces  s−1  s s−1   s
representation learning even in the H → ∞ limit. In v v −v v
PrfN i ðμ0 ;μ1 Þ ¼ kg ¼ ; ðB1Þ
practice, we set H ¼ ð4–8Þ × vs . Increasing this number k m−k m
further does not affect any of the results presented in
the paper. which is a hypergeometric distribution Hvs ;vs−1 ;m , with mean

031001-10
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)

vs−1 m with the factor (v − 1). Therefore,


hNi ¼ m ¼ ; ðB2Þ
vs v
cN ¼ Cov½N i ðμ1 ; μ2 Þ; N i ðν1 ; μ2 Þ
and variance Var½N i ðμ1 ; μ2 Þ σ2
¼− ¼− N : ðB7Þ
v−1 v−1
vs−1 vs − vs−1 vs − m
σ 2N ≔ ðN − hNiÞ2 ¼ m (b) Shared low-level feature. The joint probability of the
vs vs vs − 1 occurrences of the same low-level feature μ1 starting
m v − 1 v − m m≫1 m
s
from different high-level features μ2 ≠ ν2 can be
¼ ⟶ ; ðB3Þ
v v vs − 1 v written as follows,

independently of the position i and the specific low- and PrfNðμ1 ; μ2 Þ ¼ k; Nðμ1 ; ν2 Þ ¼ lg
high-level features. Note that, since m ≤ vs−1 with s fixed,
¼ PrfNðμ1 ; μ2 Þ ¼ kjNðμ1 ; ν2 Þ ¼ lg
large m implies also large v.
× PrfNðμ1 ; ν2 Þ ¼ lg
2. Joint statistics of a single rule ¼ Hvs −m;vs−1 −l;m ðkÞ × Hvs ;vs−1 ;m ðlÞ; ðB8Þ
(a) Shared high-level feature. For a fixed high-level
feature μ2 , the joint probability of the occurrences resulting in the following “interfeature” covariance:
of two different low-level features μ1 and ν1 is a  2
multivariate hypergeometric distribution, m v−1
cif ≔ Cov½N i ðμ1 ; μ2 Þ; N i ðμ1 ; ν2 Þ ¼ − :
v vs − 1
PrfN i ðμ1 ; μ2 Þ ¼ k; N i ðν1 ; μ2 Þ ¼ lg ðB9Þ
 s−1  s−1  s   s
v v v − 2vs−1 v
¼ ; ðB4Þ (c) NoPshared features. Finally, by multiplying both sides
k l m−k−l m
of μ1 Nðμ1 ; μ2 Þ ¼ m with Nðν1 ; ν2 Þ and averaging,
we get
giving the following covariance:
cg ≔ Cov½N i ðμ1 ; μ2 Þ; N i ðν1 ; ν2 Þ
cN ≔ hðN i ðμ1 ; μ2 Þ − hNiÞðN i ðν1 ; μ2 Þ − hNiÞi Cov½N i ðμ1 ; μ2 Þ; N i ðμ1 ; ν2 Þ
 2 ¼−
m vs − m m≫1 m 1 v−1
¼− 2 s ⟶ − : ðB5Þ  2
v v −1 v m m 1
¼ : ðB10Þ
v vs − 1
The
P covariance can also be obtained via the constraint
μ1 N i ðμ1 ; μ2 Þ ¼ m. For any finite sequence of
identically distributed P random variables Xμ with a APPENDIX C: EMERGENCE OF INPUT-OUT
constraint on the sum μ X μ ¼ m: CORRELATIONS (Pc )
As discussed in the main text, the random hierarchy
X
v X
v model presents a characteristic sample size Pc correspond-
Xμ ¼ m ⇒ ðXμ − hXμ iÞ ¼ 0 ing to the emergence of the input-output correlations. This
μ¼1 μ¼1 sample size predicts the sample complexity of deep CNNs,
X
v as we also discuss in the main text. In this appendix,
⇒ ðXν − hX ν iÞ ðXμ − hXμ iÞ ¼ 0 we prove that
μ¼1
X
v nc ;m→∞
Pc ⟶ nc mL : ðC1Þ
⇒ hðXν − hXν iÞðXμ − hXμ iÞi ¼ 0
μ¼1

⇒ Var½Xμ  þ ðv − 1ÞCov½Xμ ; Xν  ¼ 0: 1. Estimating the signal


The correlations between input features and the class
ðB6Þ
label can be quantified via the conditional probability
(over realizations of the RHM) of a data point belonging
In the last line, we used the identically distributed to class α conditioned on displaying the s-tuple μ in the
variables hypothesis to replace the sum over μ ≠ ν jth input patch,

031001-11
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)

f j ðαjμÞ ≔ Prfx ∈ αjxj ¼ μg; ðC2Þ so that the number of occurrences of feature μ1 in position
i1 …iL of the inputs belonging to class α is
where the notation xj ¼ μ means that the elements of ð1→LÞ
the patch xj encode the tuple of features μ. We say that the N i1→L ðμ1 ;αÞ
low-level features are correlated with the output if X
v
ð1Þ ðLÞ
¼ mðs
L −1Þ=ðs−1Þ−L
N i1 ðμ1 ; μ2 Þ ×   × N iL ðμL ; αÞ;
μ2 ;…;μL ¼1
1
f j ðαjμÞ ≠ ; ðC3Þ ðC6Þ
nc

and define a “signal” as the difference f j ðαjμÞ − n−1


c . In the
where we used i1→L as a shorthand notation for the tuple of
following, we compute the statistics of the signal over indices i1 ; i2 ; …; iL .
realizations of the RHM. The same construction allows us to compute the
number of occurrences of up to s − 1 features within
the s-dimensional patch of the input corresponding to the
a. Occurrence of low-level features path i2→L . The number of occurrences of a whole s-tuple,
Let us begin by defining the joint occurrences of a instead, follows a slightly different rule, since there is only
class label α and a low-level feature μ1 in a given position one level-2 feature μ2 which generates the whole s-tuple of
of the input. Using the tree representation of the model, we level-1 features μ1 ¼ ðμ1;1 ; …; μ1;s Þ—we call this feature
will identify an input position with a set of L indices g1 ðμ1 Þ, with g1 denoting the first-layer composition rule.
il ¼ 1; …; s, each indicating which branch to follow when As a result, the sum over μ2 in the right-hand side of
descending from the root (class label) to a given leaf (low- Eq. (C6) disappears and we are left with
level feature). These joint occurrences can be computed by
combining the occurrences of the single rules introduced in ð1→LÞ
N i2→L ðμ1 ; αÞ ¼ mðs
L −1Þ=ðs−1Þ−L
Appendix B. With L ¼ 2, for instance,
X
v
ð2Þ ðLÞ
v 
X  × N i2 ½g1 ðμ1 Þ; μ3  ×    × N iL ðμL ; αÞ: ðC7Þ
ð1→2Þ ð1Þ ð2Þ
N i1 i2 ðμ1 ; αÞ ¼ ms−1 N i1 ðμ1 ; μ2 Þ × N i2 ðμ2 ; αÞ; μ3 ;…;μL ¼1
μ2 ¼1

ðC4Þ Coincidentally, Eq. (C7) shows that the joint occurrences


of an s-tuple of low-level features μ1 depend on the level-2
where ð1→LÞ
ð2Þ
(i) N i2 ðμ2 ; αÞ counts the occurrences of μ2 in position feature corresponding to μ1 . Hence, N i2→L ðμ1 ; αÞ is
i2 of the level-2 representations of α, i.e., the s-tuples invariant for the exchange of μ1 with one of its synonyms,
generated from α according to the second-layer i.e., level-1 tuples ν1 corresponding to the same level-2
composition rule; feature.
ð1Þ
(ii) N i1 ðμ1 ; μ2 Þ counts the occurrences of μ1 in position
i1 of the level-1 representations of μ2 , i.e., s-tuples b. Class probability conditioned
generated by μ2 according to the composition rule of on low-level observations
the first layer;
(iii) the factor ms−1 counts the descendants of the We can turn these numbers into probabilities by normal-
remaining s − 1 elements of the level-2 representa- izing them appropriately. Upon dividing by the total
tion (m descendants per element); occurrences of a low-level feature μ1 independently of
(iv) the sum over μ2 counts all the possible paths of the class, for instance, we obtain the conditional probability
features that lead to μ1 from α across 2 generations. of the class of a given input, conditioned on the feature in
The generalization of Eq. (C4) is immediate once one takes position i1 …iL being μ1 :
into account that the multiplicity factor accounting for
ð1→LÞ
the descendants of the remaining positions at the lth ð1→LÞ N i1→L ðμ1 ;αÞ
l−1 f i1→L ðαjμ1 Þ ≔ Pn
generation is equal to ms =m (sl−1 is the size of the ð1→LÞ 0
α0 ¼1 N i1→L ðμ1 ; α Þ
c

representation at the previous level). Hence, the overall Pv ð1Þ ðLÞ


multiplicity factor after L generations is μ2 ;…;μL ¼1 N i1 ðμ1 ;μ2 Þ ×   × N iL ðμL ;αÞ
¼P Pnc ð1Þ ðLÞ
:
μLþ1 ¼1 N i1 ðμ1 ;μ2 Þ ×   × N iL ðμL ;μLþ1 Þ
v
2 L−1 μ2 ;…;μL ¼1
ms ms ms
1× ×  × ¼ mðs −1Þ=ðs−1Þ−L ; ðC5Þ
L
× ðC8Þ
m m m

031001-12
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)

Let us also introduce, for convenience, the numerator and denominator of the right-hand side of Eq. (C8):

ð1→LÞ
X
v
ð1Þ ðLÞ
U i1→L ðμ1 αÞ ¼ N i1 ðμ1 ; μ2 Þ ×    × N iL ðμL ; αÞ;
μ2 ;…;μL ¼1

ð1→LÞ
X
nc
ð1→LÞ
Di1→L ðμ1 Þ ¼ U i1→L ðμ1 ; αÞ: ðC9Þ
α¼1

c. Statistics of the numerator U


ð1→LÞ
We now determine the first and second moments of the numerator of f i1→L ðμ1 ; αÞ. Let us first recall the definition for
clarity:

ð1→LÞ
X
v
ð1Þ ðLÞ
Ui1→L ðμ1 ; αÞ ¼ N i1 ðμ1 ; μ2 Þ ×    × N iL ðμL ; αÞ: ðC10Þ
μ2 ;…;μL ¼1

(a) Level 1: L ¼ 1. For L ¼ 1, U is simply the occurrence of a single production rule N i ðμ1 ; αÞ,
m
U ð1Þ ¼ ; ðC11Þ
v
m v − 1 vs − m v≫1 m
σ 2Uð1Þ ≔ Var U ð1Þ ¼ ⟶ ; ðC12Þ
v v vs − 1 v
 2 s  
Var½U ð1Þ  m v − m 1 v≫1 m 2 1
cUð1Þ ≔ Cov Uð1Þ ðμ1 ; αÞ; U ð1Þ ðν1 ; αÞ ¼ − ¼− ⟶ ; ðC13Þ
ðv − 1Þ v vs − 1 m v m

where the relationship between variance and covariance is due to the constraint on the sum of Uð1Þ over μ1 ;
see Eq. (B6).
(b) Level 2: L ¼ 2. For L ¼ 2,

ð1→2Þ
X
v
ð1Þ ð2Þ
X
v
ð1Þ ð2Þ
U i1→2 ðμ1 ; αÞ ¼ N i1 ðμ1 ; μ2 Þ × N i2 ðμ2 ; αÞ ¼ N i1 ðμ1 ; μ2 ÞUi2 ðμ2 ; αÞ: ðC14Þ
μ2 ¼1 μ2 ¼1

Therefore,
   2
ð1→2Þ m ð1Þ m
U ¼v × U ¼v ; ðC15Þ
v v

X
v  
2
σ 2Uð2Þ ≔ Var Uð1→2Þ ¼ N ð1Þ ðμ1 ; μ2 ÞN ð1Þ ðμ1 ; ν2 Þ U ð2Þ ðμ2 ; αÞU ð2Þ ðν2 ; αÞ − hNi2 U ð1Þ
μ2 ;ν2 ¼1
X XX
¼  þ 
μ2 ;ν2 ¼μ2 μ2 ν2 ≠μ2
   
2 2
¼ v σ 2N σ 2Uð1Þ þ σ 2N Uð1Þ þ σ 2Uð1Þ hNi2 þ vðv − 1Þ cif cUð1Þ þ cif U ð1Þ þ cUð1Þ hNi2

¼ vðσ 2N σ 2Uð1Þ þ ðv − 1Þcif cUð1Þ Þ þ v U ð1Þ 2 ðσ 2N þ ðv − 1Þcif Þ þ vhNi2 ðσ 2Uð1Þ þ ðv − 1ÞcUð1Þ Þ


¼ vσ 2Uð1Þ ðσ 2N − cif Þ þ v U ð1Þ 2 ðσ 2N þ ðv − 1Þcif Þ; ðC16Þ

σ 2Uð2Þ
cUð2Þ ¼ − : ðC17Þ
ðv − 1Þ

031001-13
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)

(c) Level L. In general,

ð1→LÞ
X
v
ð1Þ ð2→LÞ
Ui1→L ðμ1 ; αÞ ¼ N i1 ðμ1 ; μ2 ÞU i2→L ðμ2 ; αÞ: ðC18Þ
μ2 ¼1

Therefore,
   L
ðLÞ m ðL−1Þ L−1 m
U ¼v × U ¼v ; ðC19Þ
v v

X
v  
2
σ 2UðLÞ ¼ N ð1Þ ðμ1 ; μ2 ÞN ð1Þ ðμ1 ; ν2 Þ Uð2→LÞ ðμ2 ; αÞU ð2→LÞ ðν1 ; αÞ − hNi2 Uð1→ðL−1ÞÞ
μ2 ;ν1 ¼1
X XX
¼  þ 
μ2 ;ν2 ¼μ2 μ2 ν2 ≠μ2
   
2 2
¼ v σ 2N σ 2UðL−1Þ þ σ 2N U ðL−1Þ þ σ 2UðL−1Þ hNi2 þ vðv − 1Þ σ 2if cUðL−1Þ þ cif UðL−1Þ þ cUðL−1Þ hNi2
¼ vσ 2UðL−1Þ ðσ 2N − cif Þ þ v UðL−1Þ 2 ðσ 2N þ ðv − 1Þcif Þ; ðC20Þ

σ 2UðLÞ
cUðLÞ ¼ − : ðC21Þ
ðv − 1Þ

(d) Concentration for large m. In the large multiplicity limit m ≫ 1, the U’s concentrate around their mean value. Because
of m ≤ vs−1 , large m implies large v; thus, we can proceed by setting m ¼ qvs−1 , with q ∈ ð0; 1 and studying the
v ≫ 1 limit. From Eq. (C19),

UðLÞ ¼ qL vLðs−1Þ−1 : ðC22Þ

In addition,
 2
v≫1 m m
v≫1 1
σ 2N ⟶ ¼ qv ðs−1Þ−1
; cif ⟶ − ¼ −q2 vðs−1Þ−2 ; ðC23Þ
v v vs−1

so that

σ 2UðLÞ ¼ vσ 2UðL−1Þ ðσ 2N − σ 2if Þ þ v UðL−1Þ 2 ðσ 2N þ ðv − 1Þσ 2if Þ


v≫1
⟶σ 2UðL−1Þ qvðs−1Þ þ σ 2UðL−1Þ q2 vðs−1Þ−1 þ q2L−1 ð1 − qÞvð2L−1Þðs−1Þ−2 : ðC24Þ

The second of the three terms is always subleading with respect to the first, so we can discard it for now. It remains to
compare the first and the third terms. For L ¼ 2, since σ 2Uð1Þ ¼ σ 2N , the first term depends on v as v2ðs−1Þ−1 , whereas the
third is proportional to v3ðs−1Þ−2 . For L ≥ 3, the dominant scaling is that of the third term only: for L ¼ 3 it can be
shown by simply plugging the L ¼ 2 result into the recursion, and for larger L it follows from the fact that replacing
σ 2UðL−1Þ in the first term with the third term of the precious step always yields a subdominant contribution. Therefore,


v≫1 q2 v2ðs−1Þ−1 þ q3 ð1 − qÞv3ðs−1Þ−2 for L ¼ 2
σ 2UðLÞ ⟶ ðC25Þ
2L−1 ð2L−1Þðs−1Þ−2
q ð1 − qÞv for L ≥ 3:

031001-14
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)

Upon dividing the variance by the squared mean, we get


 1 1 þ 1−q 1 for L ¼ 2
σ 2UðLÞ q2 v2ðs−1Þ−1
v≫1 q vðs−1Þ
ðLÞ 2
⟶ ðC26Þ
hU i 1−q 1
ðs−1Þ for L ≥ 3;
q v

whose convergence to 0 guarantees the concentration of the U’s around the average over all instances of the RHM.

d. Statistics of the denominator D


ð1→LÞ
Here we compute the first and second moments of the denominator of f i1→L ðμ1 ; αÞ:

ð1→LÞ
X
v X
nc
ð1Þ ðLÞ
Di1→L ðμ1 Þ ¼ N i1 ðμ1 ; μ2 Þ ×    × N iL ðμL ; μLþ1 Þ: ðC27Þ
μ2 ;…;μL ¼1 μLþ1 ¼1

(a) Level 1: L ¼ 1. For L ¼ 1, D is simply the sum over classes of the occurrences of a single production rule,
P
Dð1Þ ¼ α N i ðμ1 ; αÞ,
m
Dð1Þ ¼ nc ; ðC28Þ
v
 2  
m v − 1 vs
σ 2Dð1Þ ≔ Var D ð1Þ
¼ nc σ 2N
þ nc ðnc − 1Þcif ¼ nc − nc
v vs − 1 m
 2  
v≫1 m v nc
⟶nc − ; ðC29Þ
v m vs−1

Var½Dð1Þ 
cDð1Þ ≔ Cov Dð1Þ ðμ1 Þ; Dð1Þ ðν0 Þ ¼ − ¼ nc cN þ nc ðnc − 1Þcg ; ðC30Þ
ðv − 1Þ

where, in the last line, we used the identities σ 2N þ ðv − 1ÞcN ¼ 0 from Eq. (B5) and cif þ ðv − 1Þcg ¼ 0
from Eq. (B10).
(b) Level 2: L ¼ 2. For L ¼ 2,

ð1→2Þ
X
v X
nc
ð1Þ ð2Þ
X
v
ð1Þ ð2Þ
Di1→2 ðμ1 Þ ¼ N i1 ðμ1 ; μ2 Þ × N i2 ðμ2 ; μ3 Þ ¼ N i1 ðμ1 ; μ2 ÞDi2 ðμ2 Þ: ðC31Þ
μ2 μ3 ¼1 μ2 ¼1

Therefore,
 
ð1→2Þ m n
D ¼v × Dð1Þ ¼ c m2 ; ðC32Þ
v v
X
v  
2
σ 2Dð2Þ ≔ Var Dð1→2Þ ¼ N ð1Þ ðμ1 ; μ2 ÞN ð1Þ ðμ1 ; ν1 Þ Dð2Þ ðμ2 ÞDð2Þ ðν1 Þ − hNi2 Dð1Þ
μ2 ;ν1 ¼1
X XX
¼  þ 
μ2 ;ν1 ¼μ2 μ2 ν1 ≠μ2
   
2 2
¼ v σ 2N σ 2Dð1Þ þ σ 2N Dð1Þ þ σ 2Dð1Þ hNi2 þ vðv − 1Þ cif cDð1Þ þ cif Dð1Þ þ cDð1Þ hNi2

¼ vðσ 2N σ 2Dð1Þ þ ðv − 1Þcif cDð1Þ Þ þ v Dð1Þ 2 ðσ 2N þ ðv − 1Þcif Þ þ vhNi2 ðσ 2Dð1Þ þ ðv − 1ÞcDð1Þ Þ


¼ vσ 2Dð1Þ ðσ 2N − cif Þ þ v Dð1Þ 2 ðσ 2N þ ðv − 1Þcif Þ; ðC33Þ

σ 2Dð2Þ
cDð2Þ ¼ − : ðC34Þ
ðv − 1Þ

031001-15
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)

(c) Level L. In general,

ð1→LÞ
X
v
ð1Þ ð2→LÞ
Di1→L ðμ1 Þ ¼ N i1 ðμ1 ; μ2 ÞDi2→L ðμ2 Þ: ðC35Þ
μ2 ¼1

Therefore,
 
ðLÞ m n
D ¼v × DðL−1Þ ¼ c mL ; ðC36Þ
v v

X
v  
2
σ 2DðLÞ ¼ N ð1Þ ðμ1 ; μ2 ÞN ð1Þ ðμ1 ; ν1 Þ Dð2→LÞ ðμ2 ; αÞDð2→LÞ ðν1 ; αÞ − hNi2 Dð1→ðL−1ÞÞ
μ2 ;ν1 ¼1
X XX
¼  þ 
μ2 ;ν1 ¼μ2 μ2 ν1 ≠μ2
   E2 
2 2
¼ v σ 2N σ 2DðL−1Þ þ σ 2N DðL−1Þ þ σ 2DðL−1Þ hNi2 þ vðv − 1Þ cif cDðL−1Þ þ cif DðL−1Þ þ cDðL−1Þ hN

¼ vσ 2DðL−1Þ ðσ 2N − cif Þ þ v DðL−1Þ 2 ðσ 2N þ ðv − 1Þcif Þ; ðC37Þ

σ 2DðLÞ
cDðLÞ ¼ − : ðC38Þ
ðv − 1Þ

(d) Concentration for large m. Since the D’s can be expressed as a sum of different U’s, their concentration for m ≫ 1
follows directly from that of the U’s.

e. Estimate of the conditional class probability


We can now turn back to the original problem of estimating
Pv ð1Þ ðLÞ ð1→LÞ
ð1→LÞ μ2 ;…;μL ¼1 N i1 ðμ1 ; μ2 Þ ×    × N iL ðμL ; αÞ Ui1→L ðμ1 ; αÞ
f i1→L ðαjμ1 Þ ¼P Pnc ð1Þ ðLÞ
¼ ð1→LÞ
: ðC39Þ
μLþ1 ¼1 N i1 ðμ1 ; μ2 Þ ×    × N iL ðμL ; μLþ1 Þ Di1→L ðμ1 Þ
v
μ2 ;…;μL ¼1

Having shown that both numerator and denominator converge to their average for large m, we can expand for small
fluctuations around these averages and write
 U
ð1→LÞ
ðμ1 ;αÞ−mL =v

v−1 mL 1 þ i1→L mL =v
ð1→LÞ
f i1→L ðαjμ1 Þ ¼  ð1→LÞ  ðC40Þ
D ðμ1 Þ−nc mL =v
nc v−1 mL 1 þ i1→L mL

ð1→LÞ ð1→LÞ
1 1 U i1→L ðμ1 ; αÞ − mL =v 1 Di1→L ðμ1 Þ − nc mL =v
¼ þ −
nc nc mL =v nc mL =v
 
1 v ð1→LÞ 1 ð1→LÞ
¼ þ L Ui1→L ðμ1 ; αÞ − Di1→L ðμ1 Þ : ðC41Þ
nc nc m nc

Since the conditional frequencies average to n−1 c , the term in brackets averages to zero. We can then estimate the size of the
fluctuations of the conditional frequencies (i.e., the signal) with the standard deviation of the term in brackets.
It is important to notice that, for each L and position i1→L , D is the sum over α of U, and the U with different α at fixed
low-level feature μ1 are identically distributed. In general, for a sequence of identically distributed variables ðXα Þα¼1;…;nc ,
 2  nc    
1X v
1X 2
X 1 2
X
X ¼ 2 hXβ i þ hXβ X β0 i ¼ hXβ i þ hXβ Xβ0 i : ðC42Þ
nc β¼1 β nc β¼1 β0 ≠β
nc β0 ≠β

031001-16
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)

Hence,
 2 
1X X X
nc nc nc
Xα − X ¼ hX2α i þ n−2 hXβ Xγ i − 2n−1 hXα Xβ i
nc β¼1 β c
β;γ¼1
c
β¼1
 X 
2 −1 2
¼ hXα i − nc hXα i þ hXα Xβ i
β≠α
X
nc 2 
¼ hX2α i − n−2
c Xβ : ðC43Þ
β¼1

In our case,
 2   2   2 
ð1→LÞ 1 ð1→LÞ ð1→LÞ −2 ð1→LÞ
U i1→L ðμ1 ; αÞ − Di1→L ðμ1 Þ ¼ U i1→L ðμ1 ; αÞ − nc Di1→L ðμ1 Þ
nc
¼ σ 2UðLÞ − n−2 2
c σ DðLÞ ; ðC44Þ

where, in the second line, we have used that hUðLÞ i ¼ hDðLÞ i=nc to convert the difference of second moments into a
difference of variances. By Eqs. (C19) and (C36),

σ 2UðLÞ − n−2 2 2 2 2 ðL−1Þ 2 ðσ 2 þ ðv − 1Þσ 2 Þ


c σ DðLÞ ¼ vσ U ðL−1Þ ðσ N − σ if Þ þ v U N if
v 2 v
− 2 σ DðL−1Þ ðσ 2N − σ 2if Þ − 2 DðL−1Þ 2 ðσ 2N þ ðv − 1Þσ 2if Þ
nc nc
¼ vðσ 2N − σ 2if Þðσ 2UðL−1Þ − n−2 2
c σ DðL−1Þ Þ; ðC45Þ

having used again that hUðLÞ i ¼ hDðLÞ i=nc . Iterating,


 
σ 2UðLÞ − n−2 2 2 2 L−1 ðσ 2
c σ DðLÞ ¼ ½vðσ N − σ if Þ Uð1Þ
− n−2 2
c σ Dð1Þ Þ : ðC46Þ

Since

m v − 1 vs − m v≫1 m
σ 2Uð1Þ ¼ ⟶ ;
v v vs − 1 v
 2    
v≫1 m v nc 1m mn
n−2 2 −1 2 −1 2 −1
c σ Dð1Þ ¼ nc σ N þ nc ðnc − 1Þσ if ⟶nc − s−1 ¼ 1 − sc ; ðC47Þ
v m v nc v v

one has
 
v≫1 m
L
1 − nc m=vs
σ 2UðLÞ − n−2 2
c σ DðLÞ ⟶ 1− ; ðC48Þ
v nc

so that

h i ð1→LÞ ð1→LÞ 2
ð1→LÞ
ðUi1→L ðμ1 ; αÞ − n1c Di1→L ðμ1 ÞÞ v;nc ≫1 v 1
2
Var f i1→L ðαjμ1 Þ ¼v ⟶ : ðC49Þ
n2c m2L nc nc mL

2. Introducing sampling noise due to the finite training set


In a supervised learning setting where only P of the total data are available, the occurrences N are replaced with their
empirical counterparts N̂. In particular, the empirical joint occurrence N̂ðμ; αÞ (where we dropped level and positional
indices to ease notation) coincides with the number of successes when sampling P points without replacement from a
population of Pmax where only Nðμ; αÞ belong to class α and display feature μ in position j. Thus, N̂ðμ; αÞ obeys a
hypergeometric distribution where P plays the role of the number of trials, Pmax the population size, and the true occurrence

031001-17
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)

Nðμ; αÞ the number of favorable cases. If P is large and RHM, while σ f is the signal, σ 2f ¼ v=mL by Eq. (C49). As
Pmax , Nðμ; αÞ are both larger than P, then a result,
    rffiffiffiffiffiffiffi rffiffiffiffiffiffiffi 
Nðμ;αÞ Nðμ;αÞ Nðμ;αÞ 1
N̂ðμ;αÞ → N P ;P 1− ; ðC50Þ nc ;m;P≫1
f̂ðαjμÞ ⟶ 1þ
v
ξRHM þ
vnc
ξ : ðC54Þ
Pmax Pmax Pmax nc m L P P
where the convergence is meant as a convergence in
probability and N ða; bÞ denotes a Gaussian distribution 3. Sample complexity
with mean a and variance b. The statement above holds From Eq. (C54) it is clear that for the signal f̂, the
when the ratio Nðμ; αÞ=Pmax is away from 0 and 1, which is fluctuations due to noise must be smaller than those due to
true with probability 1 for large v due to the concentration the random choice of the composition rules. Therefore, the
of fðαjμÞ. In complete analogy, the empirical occurrence crossover takes place when the two noise terms have the
N̂ðμÞ obeys same size, occurring at P ¼ Pc such that
   rffiffiffiffiffiffiffi rffiffiffiffiffiffiffi
NðμÞ NðμÞ NðμÞ
N̂ðμÞ → N P ;P 1− : ðC51Þ v vnc
Pmax Pmax Pmax L ¼ Pc
⇒ Pc ¼ nc mL : ðC55Þ
m
We obtain the empirical conditional frequency by the ratio
of Eqs. (C50) and (C51). Since NðμÞ ¼ Pmax =v and
fðαjμÞ ¼ Nðμ; αÞ=NðμÞ, we have APPENDIX D: IMPROVED SAMPLE
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi COMPLEXITY VIA CLUSTERING
 ffi
fðαjμÞ 1 fðαjμÞ fðαjμÞ
v þ ξ P P v 1 − v
In this appendix, we consider the maximal dataset case
f̂ðαjμÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; ðC52Þ nc ¼ v and m ¼ vs−1 , and show that a distance-based
1 11 1
v þ ζ P Pv ð1 − v Þ clustering method acting on the hidden representations of
pffiffiffiffiffi
Eq. (13) would identify synonyms at P ≃ nc mL . Let us then
where ξP and ζP are correlated zero-mean and unit-variance imagine feeding the representations updates Δf h ðμÞ of
Gaussian random variables over independent drawings of Eq. (13) to a clustering algorithm aimed at identifying
the P training points. By expanding the denominator of the synonyms. This algorithm is based on the distance between
right-hand side for large P we get, after some algebra, the representations of different tuples of input features μ and ν,
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
  1X H
vfðαjμÞ fðαjμÞ kΔfðμÞ − ΔfðνÞk2 ≔ ðΔf h ðμÞ − Δf h ðνÞÞ2 ; ðD1Þ
f̂ðαjμÞ ≃ fðαjμÞ þ ξP 1− H h¼1
P v
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
  ffi
v 1 where H is the number of hidden neurons. By defining
− ζ P fðαjμÞ 1− : ðC53Þ
P v N̂ 1 ðμ; αÞ 1 N̂ 1 ðμÞ
ĝα ðμÞ ≔ − ; ðD2Þ
P nc P
Recall that, in the limit of large nc and m, fðαjμÞ ¼
n−1
c ð1 þ σ f ξRHM Þ, where ξRHM is a zero-mean and unit- and denoting with ĝðμÞ the nc -dimensional sequence having
variance Gaussian variable over the realizations of the the ĝα ’s as components, we have

X nc  X 
1 H
kΔfðμÞ − ΔfðνÞk2 ¼ a a ðĝα ðμÞ − ĝα ðνÞÞðĝβ ðμÞ − ĝβ ðνÞÞ
α;β¼1
H h h;α h;β

H→∞ X
nc
⟶ ðĝα ðμÞ − ĝα ðνÞÞ2 ¼ kĝðμÞ − ĝðνÞk2 ; ðD3Þ
α¼1

where we used the i.i.d. Gaussian initialization of the readout weights to replace the sum over neurons with δα;β .
Because of the sampling noise, from Eqs. (C50) and (C51), when 1 ≪ P ≪ Pmax ,

sffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
ĝα ðμÞ ¼ gα ðμÞ þ η ðμÞ; ðD4Þ
nc mvP α

031001-18
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)

where ηα ðμÞ is a zero-mean and unit-variance Gaussian classes and of different tuples within the same class, so that
noise and g without hat denotes the P → Pmax limit of ĝ. the size of the term kgðμÞ − gðνÞk2 for nc ¼ v ≫ 1 can be
In the limit 1 ≪ P ≪ Pmax , the noises with different α and estimated via the central limit theorem:
μ are independent of each other. Thus,  pffiffiffiffiffi 
2 2nc Oð nc Þ
kĝðμÞ − ĝðνÞk2 kgðμÞ − gðνÞk ∼ N ; : ðD8Þ
nc mvPc nc mvPc
1
¼ kgðμÞ − gðνÞk2 þ kηðμÞ − ηðνÞk2 The mixed term ðgðμÞ − gðνÞÞ · ðηðμÞ − ηðνÞÞ has zero
nc mvP
average (both with respect to training set sampling and
2
þ pffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðgðμÞ − gðνÞÞ · ðηðμÞ − ηðνÞÞ: ðD5Þ RHM realizations) and can also be shown to lead to relative
pffiffiffiffiffi
nc mvP fluctuations of order Oð nc Þ in the maximal dataset case.
To summarize, we have that, for synonyms,
If μ and ν are synonyms, then gðμÞ ¼ gðνÞ and only the
noise term contributes to the right-hand side of Eq. (D5). If
kĝðμÞ − ĝðνÞk2 ¼ kηðμÞ − ηðνÞk2
this noise is sufficiently small, then the distance above can  
be used to cluster tuples into synonymic groups. 1 1
∼ 1 þ pffiffiffiffiffi ξP ; ðD9Þ
By the independence of the noises and the central limit mvP nc
theorem, for nc ≫ 1,
pffiffiffiffiffi where ξP is some Oð1Þ noise dependent on the training set
kηðμÞ − ηðνÞk2 ∼ N ð2nc ; Oð nc ÞÞ; ðD6Þ sampling. If μ and ν are not synonyms, instead,
 
over independent samplings of the P training points. The 1 1
kĝðμÞ − ĝðνÞk2 ∼ 1 þ pffiffiffiffiffi ξP
g’s are also random variables over independent realizations mvP nc
of the RHM with zero mean and variance proportional to  
1 1
the variance of the conditional probabilities fðαjμÞ [see þ 1 þ pffiffiffiffiffi ξRHM ; ðD10Þ
Eqs. (C40) and (C49)]: mvPc nc

1 1 where ξRHM is some Oð1Þ noise dependent on the RHM


Var½gα ðμÞ ¼ L ¼ n mvP : ðD7Þ realization. In this setting, the signal is the deterministic
nc mvnc m c c
part of the difference between representations of non-
To estimate the size of kgðμÞ − gðνÞk2 we must take into synonymic tuples. Because of the sum over class labels,
account the correlations (over RHM realizations) between the signal is scaled up by a factor nc, whereas the
g’s with different class label and tuples. However, in the fluctuations (stemming from both sampling and model)
pffiffiffiffiffi
maximal dataset case nc ¼ v and m ¼ vs−1 , both the sum are only increased by Oð nc Þ. Therefore, the signal
over classes and the sum over tuples of input features of the required for clustering emerges from the sampling noise
pffiffiffiffiffi pffiffiffiffiffi
joint occurrences Nðμ; αÞ are fixed deterministically. The at P ¼ Pc = nc ¼ nc mL , equal to v1=2þLðs−1Þ in the
constraints on the sums allow us to control the correlations maximal dataset case. This prediction is tested for s ¼ 2
between occurrences of the same tuple within different in Fig. 10, which shows the error achieved by a layerwise

FIG. 10. Sample complexity for layerwise training, m ¼ nc ¼ v, L ¼ 3, s ¼ 2. Training of an L-layers network is performed
layerwise by alternating one-step GD as described in Sec. IV C and clustering of the hidden representations. Clustering of the mv ¼ v2
representations for the different one-hot-encoded input patches is performed with the k-means algorithms. Clustered representations are
then orthogonalized and the result is given to the next one-step GD procedure. Left: test error versus number of training points. Different
colors correspond to different values of v. Center: collapse of the test error curves when rescaling the x axis by vLþ1=2. Right: analogous,
when rescaling the x axis by vLþ1. The curves show a better collapse when rescaling by vLþ1=2, suggesting that these layerwise
pffiffiffi
algorithms have an advantage of a factor v over end-to-end training with deep CNNs, for which P ¼ vLþ1 .

031001-19
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)

algorithm which alternates single GD steps to clustering of depth, even though the actual dimensionality of the mani-
the resulting representations [22,62]. More specifically, the fold did not change.
weights of the first hidden layer are updated with a single In the specific case of our data, the intrinsic dimension-
GD step while keeping all the other weights frozen. The ality of the internal representations of deep CNNs mono-
resulting representations are then clustered, so as to identify tonically decreases with depth, see Fig. 11, consistently
groups of synonymic level-1 tuples. The centroids of the with the idea proposed in the main text that the CNNs solve
ensuing clusters, which correspond to level-2 features, are the problem by reducing the effective dimensionality of
orthogonalized and used as inputs of another one-step GD data layer by layer. We attribute this monotonicity to the
protocol, which aims at identifying synonymic tuples of absence of spurious or noisy directions that might lead to
level-2 features. The procedure is iterated L times. the counterintuitive effect described above.

APPENDIX E: INTRINSIC DIMENSIONALITY APPENDIX F: ADDITIONAL RESULTS ON


OF DATA REPRESENTATIONS SAMPLE COMPLEXITY
In deep learning, the representation of data at each layer This appendix collects additional results on the sample
of a network can be thought of as lying on a manifold in the complexity of deep networks trained on the RHM (Figs. 12
layer’s activation space. Measures of the intrinsic dimen- and 13), on the learning curves for “lazy” neural networks
sionality of these manifolds can provide insights into how (Fig. 14), and for a ResNet18 trained on different sub-
the networks lower the dimensionality of the problem layer samples of the benchmark dataset CIFAR10 (Fig. 15).
by layer. However, such measurements have challenges. Figure 12 shows the behavior of the sample complexity
One key challenge is that it assumes that real data exist on a with varying number of classes nc when all the other
smooth manifold, while in practice, the dimensionality is parameters of the RHM are fixed, confirming the linear
estimated based on a discrete set of points. This leads to scaling discussed in the main text.
counterintuitive results such as an increase in the intrinsic Figure 13 shows the behavior of the sample complexity
dimensionality with depth, especially near the input. An for deep fully connected networks having depth larger than
effect that is impossible for continuous smooth manifolds. L þ 1, which are not tailored to the structure of the RHM.
We resort to an example to illustrate how this increase with Notice that changing architecture seems to induce an
depth can result from spurious effects. Consider a manifold additional factor of 2L to the sample complexity, indepen-
of a given intrinsic dimension that undergoes a trans- dent of v, nc , and m. This factor is also polynomial in the
formation where one of the coordinates is multiplied by a input dimension.
large factor. This operation would result in an elongated Figure 14 presents the learning curves for deep CNNs
manifold that appears one dimensional. The measured tailored to the structure of the model and trained in the lazy
intrinsic dimensionality would consequently be one, regime on the maximal case, i.e., nc ¼ v and m ¼ vs . In
despite the higher dimensionality of the manifold. In the particular, we consider the infinite-width limit of CNNs
context of neural networks, a network that operates on such with all layers scaled by a factor H −1=2, including the
an elongated manifold could effectively “reduce” this extra, last. In this limit, CNNs become equivalent to a kernel
spurious dimension. This could result in an increase in the method [49], with an architecture-dependent kernel known
observed intrinsic dimensionality as a function of network as the neural tangent kernel. In our experiments, we use the

FIG. 11. Effective dimension of the internal representation of a CNN trained on one instance of the RHM with m ¼ nc ¼ v; L ¼ 3
resulting in Pmax ¼ 6232. Left: average nearest neighbor distance of input or network activations when probing them with a dataset of
size P. The value reported on the y axis is normalized by δ0 ¼ δðP ¼ 10Þ. The slope of δðPÞ is used as an estimate of the effective
dimension. Right: effective dimension as a function of depth. We observe a monotonic decrease, consistent with the idea that the
dimensionality of the problem is reduced by deep neural networks with depth.

031001-20
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)

FIG. 12. Sample complexity of deep CNNs, for L ¼ s ¼ 2, v ¼ 256, m ¼ 23 and different values of nc . Left: test error versus number
of training points with the color indicating the number of classes (see key). Right: sample complexity P (crosses) and law P ¼ nc mL
(black dashed).

FIG. 13. Sample complexity of deep fully connected networks with different depth, for s ¼ 2 and m ¼ nc ¼ v. Left: test error versus
number of training points. The color denotes the value of m ¼ nc ¼ v, the marker the hierarchy depth of the RHM L. Solid lines
represent networks having depth L, while dashed lines correspond to networks with depth 6 > L. Note that, in all cases, the behavior of
the test error is roughly independent of the network depth. Right: sample complexity P (crosses and circles). With respect to the case of
deep CNNs tailored to the structure of the RHM, the sample complexity of generic deep networks seems to display an additional factor
of sL independently of nc , m, and v.

FIG. 14. Learning curves of depth-(L þ 1) CNNs, for L ¼ 2,


s ¼ 2 and m ¼ nc ¼ v trained in the “lazy” regime (full
lines)—where they are equivalent to a kernel method [49]— FIG. 15. Test error versus number of training points for a
and in the “feature” learning regime (dashed lines). Different ResNet18 trained on subsamples of the CIFAR10 dataset. Results
colors correspond to different vocabulary sizes v. Vertical are the average of 10 jointly different initializations of the
lines signal Pmax ¼ vs .
L
networks and dataset sampling.

031001-21
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)

analytical form of this kernel (see, e.g., Ref. [25]) and train [13] A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A.
a kernel logistic regression classifier up to convergence. Kolchinsky, B. D. Tracey, and D. D. Cox, On the informa-
Our main result is that, in the lazy regime, the generali- tion bottleneck theory of deep learning, J. Stat. Mech.
zation error stays finite even when P ≈ Pmax ; thus, kernels (2019) 124020.
suffer from the curse of dimensionality. [14] A. Ansuini, A. Laio, J. H. Macke, and D. Zoccolan, Intrinsic
Notice that the learning curves of the lazy regime follow dimension of data representations in deep neural networks,
those of the feature learning regime for P ≪ P. This is Adv. Neural Inf. Process. Syst. 32, 6111 (2019).
[15] S. Recanatesi, M. Farrell, M. Advani, T. Moore, G. Lajoie,
because the CNN kernel can also exploit local correlations
and E. Shea-Brown, Dimensionality compression and ex-
between the label and input patches [25] to improve its
pansion in deep neural networks, arXiv:1906.00443.
performance. However, unlike in the feature regime, [16] L. Petrini, A. Favero, M. Geiger, and M. Wyart, Relative
kernels cannot build a hierarchical representation, and thus stability toward diffeomorphisms indicates performance in
their test error does not converge to zero. deep nets, Adv. Neural Inf. Process. Syst. 34, 8727 (2021).
[17] U. M. Tomasini, L. Petrini, F. Cagnetta, and M. Wyart, How
deep convolutional neural networks lose spatial informa-
tion with training, Mach. Learn. Sci. Technl. 4, 045026
(2023).
[1] A. Voulodimos, N. Doulamis, A. Doulamis, and E.
[18] A. B. Patel, T. Nguyen, and R. G. Baraniuk, A probabilistic
Protopapadakis, Deep learning for computer vision: A brief
theory of deep learning, arXiv:1504.00641.
review, Comput. Intell. Neurosci. 2018, 1 (2018).
[19] E. Mossel, Deep learning and hierarchal generative mod-
[2] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou,
els, arXiv:18612.09057.
A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton
[20] H. Mhaskar, Q. Liao, and T. Poggio, When and why are
et al., Mastering the game of Go without human knowledge,
Nature (London) 550, 354 (2017). deep networks better than shallow ones?, in Proceedings
[3] U. v. Luxburg and O. Bousquet, Distance-based classifica- of the AAAI Conference on Artificial Intelligence, 2017
tion with Lipschitz functions, J. Mach. Learn. Res. 5, 669 (AAAI Press, San Francisco, 2017), Vol. 31, 10.1609/
(2004). aaai.v31i1.10913.
[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, [21] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q.
ImageNet: A large-scale hierarchical image database, in Liao, Why and when can deep—but not shallow—networks
Proceedings of the 2009 IEEE Conference on Computer avoid the curse of dimensionality: A review, Int. J. Autom.
Vision and Pattern Recognition (IEEE, New York, 2009), Comput. 14, 503 (2017).
pp. 248–255, https://ieeexplore.ieee.org/document/5206848. [22] E. Malach and S. Shalev-Shwartz, A provably correct
[5] P. Pope, C. Zhu, A. Abdelkader, M. Goldblum, and T. algorithm for deep learning that actually works, arXiv:
Goldstein, The intrinsic dimension of images and its impact 1803.09522.
on learning, in Proceedings of the International Conference [23] J. Zazo, B. Tolooshams, D. Ba, and H. J. A. Paulson,
on Learning Representations (ICLR, 2021), https:// Convolutional dictionary learning in hierarchical networks,
openreview.net/forum?id=XJk19XzGq2J. in Proceedings of the IEEE 8th International Workshop
[6] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature on Computational Advances in Multi-Sensor Adaptive
(London) 521, 436 (2015). Processing, 2019 (IEEE, New York, 2019), Vol. 131,
[7] D. C. Van Essen and J. H. Maunsell, Hierarchical organi- 10.1109/CAMSAP45676.2019.9022440.
zation and functional streams in the visual cortex, Trends [24] J. Schmidt-Hieber, Nonparametric regression using deep
Neurosci. 6, 370 (1983). neural networks with ReLU activation function, Ann.
[8] K. Grill-Spector and R. Malach, The human visual cortex, Stat. 48, 1875 (2020), https://proceedings.mlr.press/v202/
Annu. Rev. Neurosci. 27, 649 (2004).
cagnetta23a.html.
[9] M. D. Zeiler and R. Fergus, Visualizing and understanding
[25] F. Cagnetta, A. Favero, and M. Wyart, What can be learnt
convolutional networks, in Proceedings of the 13th Euro-
with wide convolutional neural networks?, in Proceedings
pean Conference Computer Vision—-ECCV 2014, Zurich
of the International Conference on Machine Learning,
(Springer, Cham, 2014), pp. 818–833, 10.1007/978-3-319-
10590-1_53. PMLR, 2023 (PMLR, 2023), pp. 3347–3379.
[10] D. Doimo, A. Glielmo, A. Ansuini, and A. Laio, Hierar- [26] U. Grenander, Elements of Pattern Theory (Johns Hopkins
chical nucleation in deep neural networks, Adv. Neural Inf. University Press, Baltimore and London, 1996), pp. xiii +
Process. Syst. 33, 7526 (2020), https://proceedings.neurips 222, 10.1002/bimj.4710390707.
.cc/paper/2020/hash/54f3bc04830d762a3b56a789b6ff62df- [27] M. Mézard, Mean-field message-passing equations in the
Abstract.html. Hopfield model and its generalizations, Phys. Rev. E 95,
[11] J. Bruna and S. Mallat, Invariant scattering convolution 022117 (2017).
networks, IEEE Trans. Pattern Anal. Mach. Intell. 35, 1872 [28] S. Goldt, M. Mézard, F. Krzakala, and L. Zdeborová,
(2013). Modeling the influence of data structure on learning in
[12] R. Shwartz-Ziv and N. Tishby, Opening the black box of neural networks: The hidden manifold model, Phys. Rev. X
deep neural networks via information, arXiv:1703.00810. 10, 041044 (2020).

031001-22
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)

[29] A. M. Saxe, J. L. McClelland, and S. Ganguli, A math- [46] S. Spigler, M. Geiger, and M. Wyart, Asymptotic learning
ematical theory of semantic development in deep neural curves of kernel methods: Empirical data versus teacher–
networks, Proc. Natl. Acad. Sci. U.S.A. 116, 11537 (2019). student paradigm, J. Stat. Mech. (2020) 124001.
[30] Y. Bahri, J. Kadmon, J. Pennington, S. S. Schoenholz, J. [47] A. Favero, F. Cagnetta, and M. Wyart, Locality defeats the
Sohl-Dickstein, and S. Ganguli, Statistical mechanics of curse of dimensionality in convolutional teacher-student
deep learning, Annu. Rev. Condens. Matter Phys. 11, 501 scenarios, Adv. Neural Inf. Process. Syst. 34, 9456 (2021),
(2020). https://proceedings.neurips.cc/paper_files/paper/2021/file/
[31] A. Ingrosso and S. Goldt, Data-driven emergence of 4e8eaf897c638d519710b1691121f8cb-Paper.pdf.
convolutional structure in neural networks, Proc. Natl. [48] R. Aiudi, R. Pacelli, A. Vezzani, R. Burioni, and P. Rotondo,
Acad. Sci. U.S.A. 119, e2201854119 (2022). Local kernel renormalization as a mechanism for feature
[32] E. DeGiuli, Random language model, Phys. Rev. Lett. 122, learning in overparametrized convolutional neural net-
128301 (2019). works, arXiv:2307.11807.
[33] F. Bach, The quest for adaptivity, Machine Learning [49] A. Jacot, F. Gabriel, and C. Hongler, Neural tangent
Research Blog (2021), https://francisbach.com/quest-for- kernel: Convergence and generalization in neural networks,
adaptivity/. Adv. Neural Inf. Process. Syst. 31, 8580 (2018),
[34] L. Györfi, M. Kohler, A. Krzyzak, H. Walk et al., A https://proceedings.neurips.cc/paper_files/paper/2018/file/
Distribution-Free Theory of Nonparametric Regression, 5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf.
Vol. 1 (Springer, New York, 2002), 10.1007/b97848. [50] L. Chizat, E. Oyallon, and F. Bach, On lazy training in
[35] S. Kpotufe, k-NN regression adapts to local intrinsic differentiable programming, in Advances in Neural Informa-
dimension, Adv. Neural Inf. Process. Syst. 24, 729 tion Processing Systems (Curran Associates, Inc., 2019),
(2011), https://proceedings.neurips.cc/paper_files/paper/ pp. 2937–2947, https://proceedings.neurips.cc/paper_files/
2011/file/05f971b5ec196b8c65b75d2ef8267331-Paper.pdf. paper/2019/file/ae614c557843b1df326cb29c57225459-Paper
[36] T. Hamm and I. Steinwart, Adaptive learning rates for .pdf.
support vector machines working on data with low intrinsic [51] G. Rozenberg and A. Salomaa, Handbook of Formal
dimension, Ann. Stat. 49, 3153 (2021). Languages (Springer Berlin, Heidelberg, 1997), 10.1007/
[37] M. Geiger, L. Petrini, and M. Wyart, Landscape and 978-3-642-59136-5.
training regimes in deep learning, Phys. Rep. 924, 1 [52] G. Yang and E. J. Hu, Feature learning in infinite-width
(2021). neural networks, arXiv:2011.14522.
[38] J. Paccolat, L. Petrini, M. Geiger, K. Tyloo, and M. Wyart, [53] Let us focus on the first s-dimensional patch of the input x1 ,
Geometric compression of invariant manifolds in neural which can take mv distinct values—m for each of the v
networks, J. Stat. Mech. (2021) 044001. level-2 features. For a linear transformation, insensitivity is
[39] E. Abbe, E. Boix-Adsera, M. S. Brennan, G. Bresler, and D. equivalent to the following set of constraints: For each level-
Nagaraj, The staircase property: How hierarchical struc- 2 feature μ, and x1;i encoding for one of the m level-1
ture can guide deep learning, in Advances in Neural representations generated by μ, w · x1;i ¼ cμ . Since cμ is an
Information Processing Systems, Vol. 34, edited by M. arbitrary constant, there are v × ðm − 1Þ constraints for the
Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. v × s components of w, which cannot be satisfied in general
Vaughan (Curran Associates, Inc., 2021), pp. 26989–27002, unless m ≤ ðs þ 1Þ.
https://proceedings.neurips.cc/paper_files/paper/2021/file/ [54] The notation xj ¼ μ means that the elements of the patch xj
e2db7186375992e729165726762cb4c1-Paper.pdf. encode the tuple of features μ.
[40] B. Barak, B. Edelman, S. Goel, S. Kakade, E. Malach, and C. [55] A. Damian, J. Lee, and M. Soltanolkotabi, Neural networks
Zhang, Hidden progress in deep learning: SGD learns parities can learn representations with gradient descent, in Pro-
near the computational limit, Adv. Neural Inf. Process. Syst. ceedings of Thirty-Fifth Conference on Learning Theory,
35, 21750 (2022), https://proceedings.neurips.cc/paper_files/ 2022 (PMLR, 2022), Vol. 178, p. 5413, https://proceedings
paper/2022/file/884baf65392170763b27c914087bde01- .mlr.press/v178/damian22a.html.
Paper-Conference.pdf. [56] J. Ba, M. A. Erdogdu, T. Suzuki, Z. Wang, D. Wu, and G.
[41] Y. Dandi, F. Krzakala, B. Loureiro, L. Pesce, and L. Yang, High-dimensional asymptotics of feature learning:
Stephan, Learning two-layer neural networks, one (giant) How one gradient step improves the representation,
step at a time, arXiv:2305.18270. Adv. Neural Inf. Process. Syst. 35, 37932 (2022),
[42] F. Bach, Breaking the curse of dimensionality with convex https://proceedings.neurips.cc/paper_files/paper/2022/file/
neural networks, J. Mach. Learn. Res. 18, 629 (2017), http:// f7e7fabd73b3df96c54a320862afcb78-Paper-Conference.pdf.
jmlr.org/papers/v18/14-546.html. [57] Here invariance to exchange of level-1 synonyms can
[43] E. Gardner and B. Derrida, Three unfinished works on the already be achieved at the first hidden layer due to the
optimal storage capacity of networks, J. Phys. A 22, 1983 orthogonalization of the s-dimensional patches of the input,
(1989). which makes them linearly separable.
[44] L. Zdeborová and F. Krzakala, Statistical physics of [58] M. Denil, B. Shakibi, L. Dinh, M. A. Ranzato, and N. de
inference: Thresholds and algorithms, Adv. Phys. 65, Freitas, Predicting parameters in deep learning, in Advances
453 (2016). in Neural Information Processing Systems, Vol. 26 (2013),
[45] M. Mézard, Spin glass theory and its new challenge: https://proceedings.neurips.cc/paper_files/paper/2013/file/
Structured disorder, Indian J. Phys., 1 (2023). 7fec306d1e665bc9c748b5d2b99a6e97-Paper.pdf.

031001-23
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)

[59] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. [64] N. Kruger, P. Janssen, S. Kalkan, M. Lappe, A. Leonardis, J.
Fergus, Exploiting linear structure within convolutional Piater, A. J. Rodriguez-Sanchez, and L. Wiskott, Deep
networks for efficient evaluation, in Advances in Neural hierarchies in the primate visual cortex: What can we learn
Information Processing Systems, Vol. 27 (Curran Associ- for computer vision?, IEEE Trans. Pattern Anal. Mach.
ates, Inc., 2014), https://proceedings.neurips.cc/paper_files/ Intell. 35, 1847 (2012).
paper/2014/file/2afe4567e1bf64d32a5527244d104cea-Paper [65] A. Krizhevsky, Learning multiple layers of features from
.pdf. tiny images (2009), https://www.cs.toronto.edu/~kriz/
[60] X. Yu, T. Liu, X. Wang, and D. Tao, On compressing deep learning-features-2009-TR.pdf.
models by low rank and sparse decomposition, in Proceed- [66] L. Petrini and F. Cagnetta, Random hierarchy model (2023),
ings of the 2017 IEEE Conference on Computer Vision and 10.5281/zenodo.12509435; https://github.com/pcsl-epfl/
Pattern Recognition (CVPR) (IEEE, New York, 2017),
hierarchy-learning/blob/master/datasets/hierarchical.py.
Vol. 67, https://openaccess.thecvf.com/content_cvpr_2017/
[67] https://github.com/pcsl-epfl/ hierarchy-learning/blob/master/
html/Yu_On_Compressing_Deep_CVPR_2017_paper.html.
models.
[61] F. Guth, B. Ménard, G. Rochette, and S. Mallat, A rainbow
[68] https://github.com/pcsl-epfl/hierarchy-learning/blob/master.
in deep network black boxes, arXiv:2305.18512.
[69] G. Yang, Scaling limits of wide neural networks with weight
[62] E. Malach and S. Shalev-Shwartz, The implications of
local correlation on learning some deep functions, sharing: Gaussian process behavior, gradient independence,
Adv. Neural Inf. Process. Syst. 33, 1322 (2020), and neural tangent kernel derivation, arXiv:1902.04760.
https://proceedings.neurips.cc/paper_files/paper/2020/file/ [70] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G.
0e4ceef65add6cf21c0f3f9da53b71c0-Paper.pdf. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al.,
[63] S. Shalev-Shwartz, O. Shamir, and S. Shammah, Failures of PYTORCH: An imperative style, high-performance deep
gradient-based deep learning, in Proceedings of the learning library, Adv. Neural Inf. Process. Syst. 32, 8026
International Conference on Machine Learning (PMLR, (2019), https://proceedings.neurips.cc/paper_files/paper/
2017), pp. 3067–3075, https://proceedings.mlr.press/v70/ 2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf.
shalev-shwartz17a.html.

031001-24

You might also like