PhysRevX 14 031001
PhysRevX 14 031001
PhysRevX 14 031001
How Deep Neural Networks Learn Compositional Data: The Random Hierarchy Model
Francesco Cagnetta ,1,*,† Leonardo Petrini ,1,* Umberto M. Tomasini ,1
Alessandro Favero ,1,2 and Matthieu Wyart1,‡
1
Institute of Physics, EPFL, Lausanne, Switzerland
2
Institute of Electrical Engineering, EPFL, Lausanne, Switzerland
(Received 19 December 2023; revised 15 April 2024; accepted 31 May 2024; published 1 July 2024)
Deep learning algorithms demonstrate a surprising ability to learn high-dimensional tasks from limited
examples. This is commonly attributed to the depth of neural networks, enabling them to build a hierarchy
of abstract, low-dimensional data representations. However, how many training examples are required to
learn such representations remains unknown. To quantitatively study this question, we introduce the
random hierarchy model: a family of synthetic tasks inspired by the hierarchical structure of language and
images. The model is a classification task where each class corresponds to a group of high-level features,
chosen among several equivalent groups associated with the same class. In turn, each feature corresponds to
a group of subfeatures chosen among several equivalent groups and so on, following a hierarchy of
composition rules. We find that deep networks learn the task by developing internal representations
invariant to exchanging equivalent groups. Moreover, the number of data required corresponds to the point
where correlations between low-level features and classes become detectable. Overall, our results indicate
how deep networks overcome the curse of dimensionality by building invariant representations and provide
an estimate of the number of data required to learn a hierarchical task.
DOI: 10.1103/PhysRevX.14.031001 Subject Areas: Complex Systems,
Interdisciplinary Physics,
Statistical Physics
complexity, i.e., the number of training data necessary for low-dimensional manifold and (ii) the task being
learning such representations, and thus the task. smooth [33]. For instance, in the context of regression,
In this paper, we study the relationship between sample the sample complexity is not controlled by the bare input
complexity, depth of the learning method, and structure of dimensionality d, but by the ratio dM =s [34–36], where dM
the data by focusing on tasks with a hierarchically composi- is the dimension of the data manifold and s the number of
tional structure—arguably a key property for the learn- bounded derivatives of the target function. However, dM is
ability of real data [18–25]. To provide a concrete example, also large in practice [5]; thus, keeping dM =s low requires
consider a picture that consists of several high-level an unrealistically large number of bounded derivatives.
features like face, body, and background. Each feature is Moreover, properties (i) and (ii) can already be leveraged
composed of subfeatures like ears, mouth, eyes, and nose by isotropic kernel methods, and thus cannot account for
for the face, which can be further thought of as combina- the significant advantage of deep learning methods in many
tions of low-level features such as edges [26]. Recent benchmark datasets [37]. Alternatively, learnability can be
studies have revealed that deep networks can represent achieved when (iii) the task depends on a small number of
hierarchically compositional functions with far fewer linear projections of the input variables, such as regression
parameters than shallow networks [21], implying an of a target function f ðxÞ ¼ gðxt Þ, where x ∈ Rd and
information-theoretic lower bound on the sample complex- xt ∈ Rt [38–41]. Methods capable of learning features
ity which is only polynomial in the input dimension [24]. from the data can leverage this property to achieve a
While these works offer important insights, they do not sample complexity that depends on t instead of d [42].
characterize the performance of deep neural networks However, one-hidden-layer networks are sufficient for that;
trained with gradient descent (GD). hence, this property does not explain the need for deep
We investigate this question by adopting the physicist’s architectures.
approach [27–32] of introducing a model of synthetic data, In the context of statistical physics, the quest for a model
which is inspired by the structure of natural problems, of data structure has been pursued within the framework
yet simple enough to be investigated systematically. This of teacher-student models [43–45], where a teacher uses
model (Sec. II) belongs to a family of hierarchical classi- some ground truth knowledge to generate data, while a
fication problems where the class labels generate the input student tries to infer the ground truth from the data. The
data via a hierarchy of composition rules. These problems structural properties (i)–(iii) can be incorporated into this
were introduced to highlight the importance of input-to- approach [28,46]. In addition, using a shallow convolu-
label correlations for learnability [19] and were found to be tional network as a teacher allows for modeling (iv) the
learnable via an iterative clustering algorithm [22]. Under locality of imagelike datasets [31,47,48]. In the context of
the assumption of randomness of the composition rules, regression,
we show empirically that shallow networks suffer from P this property can be modeled with a function
f ðxÞ ¼ i f i ðxi Þ where the sum is on all patches xi of t
the curse of dimensionality (Sec. III), whereas the sample adjacent pixels. Convolutional networks learn local tasks
complexity P of deep networks (both convolutional net- with a sample complexity controlled by the patch dimen-
works and multilayer perceptrons) is only polynomial in sion t [47], even in the “lazy” regime [49,50] where they do
the size of the input. More specifically, with nc classes not learn features. However, locality does not allow for
and L composition rules that associate m equivalent low- long-range nonlinear dependencies in the task. It might
level representations to each class or high-level feature, be tempting to include these dependencies by considering
P ≃ nc mL asymptotically in m (Sec. III). a deep convolutional teacher network, but then the
Furthermore, we find that P coincides with both (a) the sample complexity would be exponential in the input
number of data that allows for learning a representation dimension d [25].
that is invariant to exchanging the m semantically equiv- The present analysis based on hierarchical generative
alent low-level features (Sec. III A) and (b) the size of the models shows that properties (i)–(iii) are not necessary to
training set for which the correlations between low-level beat the curse of dimensionality. Indeed, for some choices
features and class label become detectable (Sec. IV). We of the parameters, the model generates all possible
prove for a simplified architecture trained with gradient d-dimensional sequences of input features, which violates
descent that (a) and (b) must indeed coincide. Via (b), P (i). Additionally, changing a single input feature has a finite
can be derived analytically under our assumption of probability of changing the label, violating the smoothness
randomness of the composition rules. assumption (ii). Finally, the label depends on all of the d
input variables of the input, violating (iii). Yet, we find that
A. Relationship to other models of data structure the sample complexity of deep neural networks is only
Characterizing the properties that make high- polynomial in d. Since locality is incorporated hierarchi-
dimensional data learnable is a classical problem in cally in the generative process, it generates long-range
statistics. Typical assumptions that allow for avoiding dependencies in the task, but it can still be leveraged by
the curse of dimensionality include (i) data lying on a building a hierarchical representation of the data.
031001-2
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)
FIG. 1. The random hierarchy model. Left: structure of the generative model. The class label α ¼ 1; …; nc generates a set of m
equivalent (i.e., synonymic) high-level representations with elements taken from a vocabulary of high-level features V L . Similarly,
high-level features generate m equivalent lower-level representations, taken from a vocabulary V L−1. Repeating this procedure L − 2
times yields all the input data with label α, consisting of low-level features taken from V 1 . Right: example of random hierarchy
model with nc ¼ 2 classes, L ¼ 3, s ¼ 2, m ¼ 3, and homogeneous vocabulary size v1 ¼ v2 ¼ v3 ¼ 3. The three sets of rules are
listed at the top, while two examples of data generation are shown at the bottom. The first example is obtained by following the rules
in the colored boxes.
II. RANDOM HIERARCHY MODEL Following the analogy with language, we refer to these
equivalent representations as synonyms. We assume that a
In this section, we introduce our generative model, which
single low-level representation there is only one high-level
can be thought of as an L-level context-free grammar—
feature that generates it, i.e., that there are no ambiguities.
a generative model of language from formal language
Since the number of distinct s-tuples at level l is bounded
theory [51]. The model consists of a set of class labels
by vsl, this assumption requires mvlþ1 ≤ vsl for all
C ≡ f1; …; nc g and L disjoint vocabularies V l ≡
l ¼ 1; …; L (with vLþ1 ≡ nc ). If m ¼ 1, each label gen-
fal1 ; …; alvl g of low- and high-level features. As illustrated erates only a single datum and the model is trivial. For
in Fig. 1, left-hand panel, data are generated from the class m > 1, the number of data per class grows exponentially
labels. Specifically, each label generates m distinct high- with the input dimension d ¼ sL :
level representations via m composition rules of the form
PL−1
si
m × ms × × ms ¼m ¼ mðd−1Þ=ðs−1Þ : ð3Þ
L−1
ðLÞ ðLÞ ðLÞ i¼0
α↦ μ1 ; …; μs for α ∈ C and μi ∈ V L ; ð1Þ
having size s > 1. The s elements of these representations In particular, in the case where mvlþ1 ¼ vsl , the model
ðLÞ generates all the possible data made of d features in V 1 .
are high-level features μi such as background, face, and Instead, for mvlþ1 < vsl, the set of available input data is
body for a picture. Each high-level feature generates in turn given by the application of the composition rules; therefore,
m lower-level representations via other m rules, it inherits the hierarchical structure of the model.
Let us remark that, due to the nonambiguity assumption,
ðl−1Þ ðl−1Þ ðl−1Þ
μðlÞ ↦ μ1 ;…;μs for μðlÞ ∈V l ; μi ∈V l−1 ; ð2Þ each set of composition rules can be summarized with a
function gl that associates s-tuples of level-l features to the
from l ¼ L down to l ¼ 1. The input features μð1Þ corresponding level-(l þ 1) feature. The domain of gl is a
represent low-level features such as the edges in an image. subset of V sl consisting of the mvlþ1 s-tuples generated by
Because of the hierarchical structure of the generative the features at level (l þ 1). Using these functions, the label
ð1Þ ð1Þ
process, each datum can be represented as a tree of α ≡ μðLþ1Þ of an input datum μð1Þ ¼ ðμ1 ; …; μd Þ can be
branching factor s and depth L, where the root is the class written as a hierarchical composition of L local functions
label, the leaves are the input features, and the hidden nodes of s variables [20,21]:
are the level-l features with l ¼ 2; …; L.
In addition, for each level l, there are m distinct rules
ðlþ1Þ ðlÞ ðlÞ
emanating from the same higher-level feature μðlÞ ; i.e., μi ¼ gl μði−1Þsþ1 ; …; μði−1Þsþ1 ; ð4Þ
there are m equivalent lower-level representations of μðlÞ
(see Fig. 1, right-hand panel, for an example with m ¼ 3). for i ¼ 1; …; sL−l and l ¼ 1; …; L.
031001-3
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)
031001-4
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)
031001-5
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)
N j ðμ; αÞ
f j ðαjμÞ ≔ Prfx ∈ αjxj ¼ μg ¼ : ð9Þ
N j ðμÞ
FIG. 6. Synonymic sensitivities Sk;l of the layers of a depth-
(L þ 1) CNN trained on an RHM with L ¼ 3, s ¼ 2, nc ¼ m ¼ If the low-level features are homogeneously spread across
v ¼ 8, as a function of the training set size P. The colors denote classes, then f ¼ n−1c for all α, μ, and j. In contrast, due to
the level of the exchanged synonyms (as in the key), whereas
the aforementioned correlations, the probabilities of the
different panels correspond to the sensitivity of the activations
of different layers (layer index in the gray box). Synonymic RHM are all different from n−1 c —we refer to this difference
invariance is learned at the same training set size for all layers, as signal. Distinct level-1 tuples μ and ν yield a different f
and invariance to level-l exchanges is obtained from layer (and thus a different signal) with high probability unless μ
k ¼ l þ 1. and ν are synonyms, i.e., they share the same level-2
031001-6
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)
representation. Therefore, this signal can be used to identify displaying feature μ in the jth patch. Therefore, by the
synonymous level-1 tuples. convergence of the empirical measure to the true proba-
bility, the sampling fluctuations of f̂ relative to the mean
B. Signal versus sampling noise are of order ½P=ðnc mvÞ−1=2 —see Appendix C for a
When measuring the conditional class probabilities with detailed derivation. Balancing signal and noise yields
only P training data, the occurrences in the right-hand side the characteristic Pc for the emergence of correlations.
of Eq. (9) are replaced with empirical occurrences, which For large m, nc , and P,
induce a sampling noise on the f’s. For the identification of
synonyms to be possible, this noise must be smaller in Pc ¼ nc mL ; ð10Þ
magnitude than the aforementioned signal—a visual rep-
resentation of the comparison between signal and noise is which coincides with the empirical sample complexity of
depicted in Fig. 7. deep networks discussed in Sec. III.
The magnitude of the signal can be computed as the ratio
between the standard deviation and mean of f j ðαjμÞ over C. Learning level-1 synonyms with one step
realizations of the RHM. The full calculation is presented in of gradient descent
Appendix C: Here we present a simplified argument based To complete the argument, we consider a simplified
on an additional independence assumption. Given a class α, one-step gradient descent setting [55,56], where Pc marks
the tuple μ appearing in the jth input patch is determined the number of training examples required to learn a
by a sequence of L choices—one choice per level of the
synonymic invariant representation. In particular, we
hierarchy—of one among m possible lower-level repre-
focus on the s-dimensional patches of the data and study
sentations. These mL possibilities lead to all the mv distinct how a two-layer network acting on one of such patches
input s-tuples. N j ðμ; αÞ is proportional to how often the learns the first composition rule of the RHM by building
tuple μ is chosen—mL =ðmvÞ times on average. Under a representation invariant to exchanges of level-1
the assumption of independence of the mL choices, the synonyms.
fluctuations of N j ðμ; αÞ relative to its mean are given by the Let us then sample an instance of the RHM and P input-
central limit theorem and read ½mL =ðmvÞ−1=2 in the limit of label pairs ðxk;1 ; αk Þ with αk ≔ αðxk Þ for all k ¼ 1; …; P
large m. If nc is sufficiently large, the fluctuations of N j ðμÞ and xk;1 denoting the first s patch of the datum xk . The
are negligible in comparison. Therefore, the relative fluc- network output reads
tuations of f j are the same as those of N j ðμ; αÞ, and the size
of the signal is ½mL =ðmvÞ−1=2 . 1X H
The magnitude of the noise is given by the ratio between F NN ðx1 Þ ¼ a σðwh · x1 Þ; ð11Þ
H h¼1 h
the standard deviation and mean, over independent sam-
plings of a training set of fixed size P, of the empirical where the inner-layer weights wh ’s have the same dimen-
conditional probabilities f̂ j ðαjμÞ. Only P=ðnc mvÞ of the sion as x1 , the top-layer weights ah ’s are nc dimensional,
training points will, on average, belong to class α while and σðxÞ ¼ max ð0; xÞ is the rectified linear unit (ReLU)
activation function. To further simplify the problem, we
represent x1 as a vs -dimensional one-hot encoding of the
corresponding s-tuple of features. This representation is
equivalent to an orthogonalization of the input points.
In addition, the top-layer weights are initialized as i.i.d.
Gaussian with zero mean and unit variance and fixed,
whereas the wh ’s are initialized with all their elements set to
1 and trained by gradient descent on the empirical cross-
entropy loss:
FIG. 7. Signal versus noise illustration. The dashed function
P ½F NN ðxk;1 Þαðx Þ
represents the distribution of fðαjμÞ resulting from the random
sampling of the RHM rules. The solid dots illustrate the true 1X e k
L¼ − log Pnc ½F ðx Þ : ð12Þ
frequencies fðαjμÞ sampled from this distribution, with different P k¼1 β¼1 e
NN k;1 β
031001-7
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)
This component, which is set to 1 at initialization, is of shallow fully connected networks trained in the setting
updated by (minus) the corresponding component of the of this section, as a function of the number P of training
gradient of the loss in Eq. (12). Recalling also that the data for different combinations of the model parameters.
predictor is 0 at initialization, we get The bottom panel, in particular, highlights that the sensi-
tivity is close to 1 for P ≪ Pc and close to 0 for P ≫ Pc. In
Δf h ðx1 Þ ¼ −∇ðwh Þμðx Þ L addition, notice that the collapse of the preactivations of
1
synonymic tuples onto the same, synonymic invariant
1X
P X
nc
1 value, implies that the rank of the hidden weights matrix
¼ a δ δ −
P k¼1 α¼1 h;α μðx1 Þ;μðxk;1 Þ α;αðxk Þ nc tends to v—the vocabulary size of higher-level features.
Xnc This low-rank structure is typical in the weights of deep
N̂ 1 ðμðx1 Þ; αÞ 1 N̂ 1 ðμÞ networks trained on image classification [58–61].
¼ ah;α − ; ð13Þ
α¼1
P nc P (a) Including all patches via weight sharing. Let us
remark that one can easily extend the one-step setting
where N̂ 1 ðμÞ is the empirical occurrence of the s-tuple μ in to include the information from all the input patches,
the first patch of the P training points and N̂ 1 ðμ; αÞ is the for instance, by replacing the network in Eq. (11) with
(empirical) joint occurrence of the s-tuple μ and the class a one-hidden-layer convolutional network with filter
size s and nonoverlapping patches. Consequently, the
label α. As P increases, the empirical occurrences N̂
empirical occurrences on the right-hand side of
converge to the true occurrences N, which are invariant
Eq. (13) would be replaced with average occurrences
for the exchange of synonym s-tuples μ. Hence, the hidden
over the patches. However, this average results in a
representation is also invariant for the exchange of syno-
reduction of both the signal and the sampling noise
nym s-tuples in this limit.
contributions
pffiffiffiffiffiffiffiffiffito the empirical occurrences by the same
This prediction is confirmed empirically in Fig. 8, which
shows the sensitivity S1;1 of the hidden representation [57] factor sL−1. Therefore, weight sharing does not
affect the sample size required for synonymic invari-
ance in the one-step setting.
(b) Improved sample complexity via clustering. A distance-
based clustering method acting on the representations
of Eq. (13) can actually identify synonyms at
pffiffiffiffiffi pffiffiffiffiffi
P ≃ nc mL ¼ Pc = nc , which is much smaller than
Pc in the large-nc limit. Intuitively, using a sequence
instead of a scalar amplifies the signal by a factor nc and
pffiffiffiffiffi
the sampling noise by a factor nc, improving the
signal-to-noise ratio. We show that this is indeed
the case in Appendix D for the maximal dataset case
nc ¼ v and m ¼ vs−1 . Previous theoretical studies have
considered the possibility of intercalating clustering
steps in standard gradient descent methods [22,62], but
the question of whether deep learning methods can
achieve a similar sample complexity with standard end-
to-end training remains open.
031001-8
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)
031001-9
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)
031001-10
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)
independently of the position i and the specific low- and PrfNðμ1 ; μ2 Þ ¼ k; Nðμ1 ; ν2 Þ ¼ lg
high-level features. Note that, since m ≤ vs−1 with s fixed,
¼ PrfNðμ1 ; μ2 Þ ¼ kjNðμ1 ; ν2 Þ ¼ lg
large m implies also large v.
× PrfNðμ1 ; ν2 Þ ¼ lg
2. Joint statistics of a single rule ¼ Hvs −m;vs−1 −l;m ðkÞ × Hvs ;vs−1 ;m ðlÞ; ðB8Þ
(a) Shared high-level feature. For a fixed high-level
feature μ2 , the joint probability of the occurrences resulting in the following “interfeature” covariance:
of two different low-level features μ1 and ν1 is a 2
multivariate hypergeometric distribution, m v−1
cif ≔ Cov½N i ðμ1 ; μ2 Þ; N i ðμ1 ; ν2 Þ ¼ − :
v vs − 1
PrfN i ðμ1 ; μ2 Þ ¼ k; N i ðν1 ; μ2 Þ ¼ lg ðB9Þ
s−1 s−1 s s
v v v − 2vs−1 v
¼ ; ðB4Þ (c) NoPshared features. Finally, by multiplying both sides
k l m−k−l m
of μ1 Nðμ1 ; μ2 Þ ¼ m with Nðν1 ; ν2 Þ and averaging,
we get
giving the following covariance:
cg ≔ Cov½N i ðμ1 ; μ2 Þ; N i ðν1 ; ν2 Þ
cN ≔ hðN i ðμ1 ; μ2 Þ − hNiÞðN i ðν1 ; μ2 Þ − hNiÞi Cov½N i ðμ1 ; μ2 Þ; N i ðμ1 ; ν2 Þ
2 ¼−
m vs − m m≫1 m 1 v−1
¼− 2 s ⟶ − : ðB5Þ 2
v v −1 v m m 1
¼ : ðB10Þ
v vs − 1
The
P covariance can also be obtained via the constraint
μ1 N i ðμ1 ; μ2 Þ ¼ m. For any finite sequence of
identically distributed P random variables Xμ with a APPENDIX C: EMERGENCE OF INPUT-OUT
constraint on the sum μ X μ ¼ m: CORRELATIONS (Pc )
As discussed in the main text, the random hierarchy
X
v X
v model presents a characteristic sample size Pc correspond-
Xμ ¼ m ⇒ ðXμ − hXμ iÞ ¼ 0 ing to the emergence of the input-output correlations. This
μ¼1 μ¼1 sample size predicts the sample complexity of deep CNNs,
X
v as we also discuss in the main text. In this appendix,
⇒ ðXν − hX ν iÞ ðXμ − hXμ iÞ ¼ 0 we prove that
μ¼1
X
v nc ;m→∞
Pc ⟶ nc mL : ðC1Þ
⇒ hðXν − hXν iÞðXμ − hXμ iÞi ¼ 0
μ¼1
031001-11
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)
f j ðαjμÞ ≔ Prfx ∈ αjxj ¼ μg; ðC2Þ so that the number of occurrences of feature μ1 in position
i1 …iL of the inputs belonging to class α is
where the notation xj ¼ μ means that the elements of ð1→LÞ
the patch xj encode the tuple of features μ. We say that the N i1→L ðμ1 ;αÞ
low-level features are correlated with the output if X
v
ð1Þ ðLÞ
¼ mðs
L −1Þ=ðs−1Þ−L
N i1 ðμ1 ; μ2 Þ × × N iL ðμL ; αÞ;
μ2 ;…;μL ¼1
1
f j ðαjμÞ ≠ ; ðC3Þ ðC6Þ
nc
031001-12
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)
Let us also introduce, for convenience, the numerator and denominator of the right-hand side of Eq. (C8):
ð1→LÞ
X
v
ð1Þ ðLÞ
U i1→L ðμ1 αÞ ¼ N i1 ðμ1 ; μ2 Þ × × N iL ðμL ; αÞ;
μ2 ;…;μL ¼1
ð1→LÞ
X
nc
ð1→LÞ
Di1→L ðμ1 Þ ¼ U i1→L ðμ1 ; αÞ: ðC9Þ
α¼1
ð1→LÞ
X
v
ð1Þ ðLÞ
Ui1→L ðμ1 ; αÞ ¼ N i1 ðμ1 ; μ2 Þ × × N iL ðμL ; αÞ: ðC10Þ
μ2 ;…;μL ¼1
(a) Level 1: L ¼ 1. For L ¼ 1, U is simply the occurrence of a single production rule N i ðμ1 ; αÞ,
m
U ð1Þ ¼ ; ðC11Þ
v
m v − 1 vs − m v≫1 m
σ 2Uð1Þ ≔ Var U ð1Þ ¼ ⟶ ; ðC12Þ
v v vs − 1 v
2 s
Var½U ð1Þ m v − m 1 v≫1 m 2 1
cUð1Þ ≔ Cov Uð1Þ ðμ1 ; αÞ; U ð1Þ ðν1 ; αÞ ¼ − ¼− ⟶ ; ðC13Þ
ðv − 1Þ v vs − 1 m v m
where the relationship between variance and covariance is due to the constraint on the sum of Uð1Þ over μ1 ;
see Eq. (B6).
(b) Level 2: L ¼ 2. For L ¼ 2,
ð1→2Þ
X
v
ð1Þ ð2Þ
X
v
ð1Þ ð2Þ
U i1→2 ðμ1 ; αÞ ¼ N i1 ðμ1 ; μ2 Þ × N i2 ðμ2 ; αÞ ¼ N i1 ðμ1 ; μ2 ÞUi2 ðμ2 ; αÞ: ðC14Þ
μ2 ¼1 μ2 ¼1
Therefore,
2
ð1→2Þ m ð1Þ m
U ¼v × U ¼v ; ðC15Þ
v v
X
v
2
σ 2Uð2Þ ≔ Var Uð1→2Þ ¼ N ð1Þ ðμ1 ; μ2 ÞN ð1Þ ðμ1 ; ν2 Þ U ð2Þ ðμ2 ; αÞU ð2Þ ðν2 ; αÞ − hNi2 U ð1Þ
μ2 ;ν2 ¼1
X XX
¼ þ
μ2 ;ν2 ¼μ2 μ2 ν2 ≠μ2
2 2
¼ v σ 2N σ 2Uð1Þ þ σ 2N Uð1Þ þ σ 2Uð1Þ hNi2 þ vðv − 1Þ cif cUð1Þ þ cif U ð1Þ þ cUð1Þ hNi2
σ 2Uð2Þ
cUð2Þ ¼ − : ðC17Þ
ðv − 1Þ
031001-13
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)
ð1→LÞ
X
v
ð1Þ ð2→LÞ
Ui1→L ðμ1 ; αÞ ¼ N i1 ðμ1 ; μ2 ÞU i2→L ðμ2 ; αÞ: ðC18Þ
μ2 ¼1
Therefore,
L
ðLÞ m ðL−1Þ L−1 m
U ¼v × U ¼v ; ðC19Þ
v v
X
v
2
σ 2UðLÞ ¼ N ð1Þ ðμ1 ; μ2 ÞN ð1Þ ðμ1 ; ν2 Þ Uð2→LÞ ðμ2 ; αÞU ð2→LÞ ðν1 ; αÞ − hNi2 Uð1→ðL−1ÞÞ
μ2 ;ν1 ¼1
X XX
¼ þ
μ2 ;ν2 ¼μ2 μ2 ν2 ≠μ2
2 2
¼ v σ 2N σ 2UðL−1Þ þ σ 2N U ðL−1Þ þ σ 2UðL−1Þ hNi2 þ vðv − 1Þ σ 2if cUðL−1Þ þ cif UðL−1Þ þ cUðL−1Þ hNi2
¼ vσ 2UðL−1Þ ðσ 2N − cif Þ þ v UðL−1Þ 2 ðσ 2N þ ðv − 1Þcif Þ; ðC20Þ
σ 2UðLÞ
cUðLÞ ¼ − : ðC21Þ
ðv − 1Þ
(d) Concentration for large m. In the large multiplicity limit m ≫ 1, the U’s concentrate around their mean value. Because
of m ≤ vs−1 , large m implies large v; thus, we can proceed by setting m ¼ qvs−1 , with q ∈ ð0; 1 and studying the
v ≫ 1 limit. From Eq. (C19),
In addition,
2
v≫1 m m
v≫1 1
σ 2N ⟶ ¼ qv ðs−1Þ−1
; cif ⟶ − ¼ −q2 vðs−1Þ−2 ; ðC23Þ
v v vs−1
so that
The second of the three terms is always subleading with respect to the first, so we can discard it for now. It remains to
compare the first and the third terms. For L ¼ 2, since σ 2Uð1Þ ¼ σ 2N , the first term depends on v as v2ðs−1Þ−1 , whereas the
third is proportional to v3ðs−1Þ−2 . For L ≥ 3, the dominant scaling is that of the third term only: for L ¼ 3 it can be
shown by simply plugging the L ¼ 2 result into the recursion, and for larger L it follows from the fact that replacing
σ 2UðL−1Þ in the first term with the third term of the precious step always yields a subdominant contribution. Therefore,
v≫1 q2 v2ðs−1Þ−1 þ q3 ð1 − qÞv3ðs−1Þ−2 for L ¼ 2
σ 2UðLÞ ⟶ ðC25Þ
2L−1 ð2L−1Þðs−1Þ−2
q ð1 − qÞv for L ≥ 3:
031001-14
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)
whose convergence to 0 guarantees the concentration of the U’s around the average over all instances of the RHM.
ð1→LÞ
X
v X
nc
ð1Þ ðLÞ
Di1→L ðμ1 Þ ¼ N i1 ðμ1 ; μ2 Þ × × N iL ðμL ; μLþ1 Þ: ðC27Þ
μ2 ;…;μL ¼1 μLþ1 ¼1
(a) Level 1: L ¼ 1. For L ¼ 1, D is simply the sum over classes of the occurrences of a single production rule,
P
Dð1Þ ¼ α N i ðμ1 ; αÞ,
m
Dð1Þ ¼ nc ; ðC28Þ
v
2
m v − 1 vs
σ 2Dð1Þ ≔ Var D ð1Þ
¼ nc σ 2N
þ nc ðnc − 1Þcif ¼ nc − nc
v vs − 1 m
2
v≫1 m v nc
⟶nc − ; ðC29Þ
v m vs−1
Var½Dð1Þ
cDð1Þ ≔ Cov Dð1Þ ðμ1 Þ; Dð1Þ ðν0 Þ ¼ − ¼ nc cN þ nc ðnc − 1Þcg ; ðC30Þ
ðv − 1Þ
where, in the last line, we used the identities σ 2N þ ðv − 1ÞcN ¼ 0 from Eq. (B5) and cif þ ðv − 1Þcg ¼ 0
from Eq. (B10).
(b) Level 2: L ¼ 2. For L ¼ 2,
ð1→2Þ
X
v X
nc
ð1Þ ð2Þ
X
v
ð1Þ ð2Þ
Di1→2 ðμ1 Þ ¼ N i1 ðμ1 ; μ2 Þ × N i2 ðμ2 ; μ3 Þ ¼ N i1 ðμ1 ; μ2 ÞDi2 ðμ2 Þ: ðC31Þ
μ2 μ3 ¼1 μ2 ¼1
Therefore,
ð1→2Þ m n
D ¼v × Dð1Þ ¼ c m2 ; ðC32Þ
v v
X
v
2
σ 2Dð2Þ ≔ Var Dð1→2Þ ¼ N ð1Þ ðμ1 ; μ2 ÞN ð1Þ ðμ1 ; ν1 Þ Dð2Þ ðμ2 ÞDð2Þ ðν1 Þ − hNi2 Dð1Þ
μ2 ;ν1 ¼1
X XX
¼ þ
μ2 ;ν1 ¼μ2 μ2 ν1 ≠μ2
2 2
¼ v σ 2N σ 2Dð1Þ þ σ 2N Dð1Þ þ σ 2Dð1Þ hNi2 þ vðv − 1Þ cif cDð1Þ þ cif Dð1Þ þ cDð1Þ hNi2
σ 2Dð2Þ
cDð2Þ ¼ − : ðC34Þ
ðv − 1Þ
031001-15
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)
ð1→LÞ
X
v
ð1Þ ð2→LÞ
Di1→L ðμ1 Þ ¼ N i1 ðμ1 ; μ2 ÞDi2→L ðμ2 Þ: ðC35Þ
μ2 ¼1
Therefore,
ðLÞ m n
D ¼v × DðL−1Þ ¼ c mL ; ðC36Þ
v v
X
v
2
σ 2DðLÞ ¼ N ð1Þ ðμ1 ; μ2 ÞN ð1Þ ðμ1 ; ν1 Þ Dð2→LÞ ðμ2 ; αÞDð2→LÞ ðν1 ; αÞ − hNi2 Dð1→ðL−1ÞÞ
μ2 ;ν1 ¼1
X XX
¼ þ
μ2 ;ν1 ¼μ2 μ2 ν1 ≠μ2
E2
2 2
¼ v σ 2N σ 2DðL−1Þ þ σ 2N DðL−1Þ þ σ 2DðL−1Þ hNi2 þ vðv − 1Þ cif cDðL−1Þ þ cif DðL−1Þ þ cDðL−1Þ hN
σ 2DðLÞ
cDðLÞ ¼ − : ðC38Þ
ðv − 1Þ
(d) Concentration for large m. Since the D’s can be expressed as a sum of different U’s, their concentration for m ≫ 1
follows directly from that of the U’s.
Having shown that both numerator and denominator converge to their average for large m, we can expand for small
fluctuations around these averages and write
U
ð1→LÞ
ðμ1 ;αÞ−mL =v
v−1 mL 1 þ i1→L mL =v
ð1→LÞ
f i1→L ðαjμ1 Þ ¼ ð1→LÞ ðC40Þ
D ðμ1 Þ−nc mL =v
nc v−1 mL 1 þ i1→L mL
ð1→LÞ ð1→LÞ
1 1 U i1→L ðμ1 ; αÞ − mL =v 1 Di1→L ðμ1 Þ − nc mL =v
¼ þ −
nc nc mL =v nc mL =v
1 v ð1→LÞ 1 ð1→LÞ
¼ þ L Ui1→L ðμ1 ; αÞ − Di1→L ðμ1 Þ : ðC41Þ
nc nc m nc
Since the conditional frequencies average to n−1 c , the term in brackets averages to zero. We can then estimate the size of the
fluctuations of the conditional frequencies (i.e., the signal) with the standard deviation of the term in brackets.
It is important to notice that, for each L and position i1→L , D is the sum over α of U, and the U with different α at fixed
low-level feature μ1 are identically distributed. In general, for a sequence of identically distributed variables ðXα Þα¼1;…;nc ,
2 nc
1X v
1X 2
X 1 2
X
X ¼ 2 hXβ i þ hXβ X β0 i ¼ hXβ i þ hXβ Xβ0 i : ðC42Þ
nc β¼1 β nc β¼1 β0 ≠β
nc β0 ≠β
031001-16
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)
Hence,
2
1X X X
nc nc nc
Xα − X ¼ hX2α i þ n−2 hXβ Xγ i − 2n−1 hXα Xβ i
nc β¼1 β c
β;γ¼1
c
β¼1
X
2 −1 2
¼ hXα i − nc hXα i þ hXα Xβ i
β≠α
X
nc 2
¼ hX2α i − n−2
c Xβ : ðC43Þ
β¼1
In our case,
2 2 2
ð1→LÞ 1 ð1→LÞ ð1→LÞ −2 ð1→LÞ
U i1→L ðμ1 ; αÞ − Di1→L ðμ1 Þ ¼ U i1→L ðμ1 ; αÞ − nc Di1→L ðμ1 Þ
nc
¼ σ 2UðLÞ − n−2 2
c σ DðLÞ ; ðC44Þ
where, in the second line, we have used that hUðLÞ i ¼ hDðLÞ i=nc to convert the difference of second moments into a
difference of variances. By Eqs. (C19) and (C36),
Since
m v − 1 vs − m v≫1 m
σ 2Uð1Þ ¼ ⟶ ;
v v vs − 1 v
2
v≫1 m v nc 1m mn
n−2 2 −1 2 −1 2 −1
c σ Dð1Þ ¼ nc σ N þ nc ðnc − 1Þσ if ⟶nc − s−1 ¼ 1 − sc ; ðC47Þ
v m v nc v v
one has
v≫1 m
L
1 − nc m=vs
σ 2UðLÞ − n−2 2
c σ DðLÞ ⟶ 1− ; ðC48Þ
v nc
so that
h i ð1→LÞ ð1→LÞ 2
ð1→LÞ
ðUi1→L ðμ1 ; αÞ − n1c Di1→L ðμ1 ÞÞ v;nc ≫1 v 1
2
Var f i1→L ðαjμ1 Þ ¼v ⟶ : ðC49Þ
n2c m2L nc nc mL
031001-17
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)
Nðμ; αÞ the number of favorable cases. If P is large and RHM, while σ f is the signal, σ 2f ¼ v=mL by Eq. (C49). As
Pmax , Nðμ; αÞ are both larger than P, then a result,
rffiffiffiffiffiffiffi rffiffiffiffiffiffiffi
Nðμ;αÞ Nðμ;αÞ Nðμ;αÞ 1
N̂ðμ;αÞ → N P ;P 1− ; ðC50Þ nc ;m;P≫1
f̂ðαjμÞ ⟶ 1þ
v
ξRHM þ
vnc
ξ : ðC54Þ
Pmax Pmax Pmax nc m L P P
where the convergence is meant as a convergence in
probability and N ða; bÞ denotes a Gaussian distribution 3. Sample complexity
with mean a and variance b. The statement above holds From Eq. (C54) it is clear that for the signal f̂, the
when the ratio Nðμ; αÞ=Pmax is away from 0 and 1, which is fluctuations due to noise must be smaller than those due to
true with probability 1 for large v due to the concentration the random choice of the composition rules. Therefore, the
of fðαjμÞ. In complete analogy, the empirical occurrence crossover takes place when the two noise terms have the
N̂ðμÞ obeys same size, occurring at P ¼ Pc such that
rffiffiffiffiffiffiffi rffiffiffiffiffiffiffi
NðμÞ NðμÞ NðμÞ
N̂ðμÞ → N P ;P 1− : ðC51Þ v vnc
Pmax Pmax Pmax L ¼ Pc
⇒ Pc ¼ nc mL : ðC55Þ
m
We obtain the empirical conditional frequency by the ratio
of Eqs. (C50) and (C51). Since NðμÞ ¼ Pmax =v and
fðαjμÞ ¼ Nðμ; αÞ=NðμÞ, we have APPENDIX D: IMPROVED SAMPLE
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi COMPLEXITY VIA CLUSTERING
ffi
fðαjμÞ 1 fðαjμÞ fðαjμÞ
v þ ξ P P v 1 − v
In this appendix, we consider the maximal dataset case
f̂ðαjμÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; ðC52Þ nc ¼ v and m ¼ vs−1 , and show that a distance-based
1 11 1
v þ ζ P Pv ð1 − v Þ clustering method acting on the hidden representations of
pffiffiffiffiffi
Eq. (13) would identify synonyms at P ≃ nc mL . Let us then
where ξP and ζP are correlated zero-mean and unit-variance imagine feeding the representations updates Δf h ðμÞ of
Gaussian random variables over independent drawings of Eq. (13) to a clustering algorithm aimed at identifying
the P training points. By expanding the denominator of the synonyms. This algorithm is based on the distance between
right-hand side for large P we get, after some algebra, the representations of different tuples of input features μ and ν,
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1X H
vfðαjμÞ fðαjμÞ kΔfðμÞ − ΔfðνÞk2 ≔ ðΔf h ðμÞ − Δf h ðνÞÞ2 ; ðD1Þ
f̂ðαjμÞ ≃ fðαjμÞ þ ξP 1− H h¼1
P v
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi
v 1 where H is the number of hidden neurons. By defining
− ζ P fðαjμÞ 1− : ðC53Þ
P v N̂ 1 ðμ; αÞ 1 N̂ 1 ðμÞ
ĝα ðμÞ ≔ − ; ðD2Þ
P nc P
Recall that, in the limit of large nc and m, fðαjμÞ ¼
n−1
c ð1 þ σ f ξRHM Þ, where ξRHM is a zero-mean and unit- and denoting with ĝðμÞ the nc -dimensional sequence having
variance Gaussian variable over the realizations of the the ĝα ’s as components, we have
X nc X
1 H
kΔfðμÞ − ΔfðνÞk2 ¼ a a ðĝα ðμÞ − ĝα ðνÞÞðĝβ ðμÞ − ĝβ ðνÞÞ
α;β¼1
H h h;α h;β
H→∞ X
nc
⟶ ðĝα ðμÞ − ĝα ðνÞÞ2 ¼ kĝðμÞ − ĝðνÞk2 ; ðD3Þ
α¼1
where we used the i.i.d. Gaussian initialization of the readout weights to replace the sum over neurons with δα;β .
Because of the sampling noise, from Eqs. (C50) and (C51), when 1 ≪ P ≪ Pmax ,
sffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
ĝα ðμÞ ¼ gα ðμÞ þ η ðμÞ; ðD4Þ
nc mvP α
031001-18
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)
where ηα ðμÞ is a zero-mean and unit-variance Gaussian classes and of different tuples within the same class, so that
noise and g without hat denotes the P → Pmax limit of ĝ. the size of the term kgðμÞ − gðνÞk2 for nc ¼ v ≫ 1 can be
In the limit 1 ≪ P ≪ Pmax , the noises with different α and estimated via the central limit theorem:
μ are independent of each other. Thus, pffiffiffiffiffi
2 2nc Oð nc Þ
kĝðμÞ − ĝðνÞk2 kgðμÞ − gðνÞk ∼ N ; : ðD8Þ
nc mvPc nc mvPc
1
¼ kgðμÞ − gðνÞk2 þ kηðμÞ − ηðνÞk2 The mixed term ðgðμÞ − gðνÞÞ · ðηðμÞ − ηðνÞÞ has zero
nc mvP
average (both with respect to training set sampling and
2
þ pffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðgðμÞ − gðνÞÞ · ðηðμÞ − ηðνÞÞ: ðD5Þ RHM realizations) and can also be shown to lead to relative
pffiffiffiffiffi
nc mvP fluctuations of order Oð nc Þ in the maximal dataset case.
To summarize, we have that, for synonyms,
If μ and ν are synonyms, then gðμÞ ¼ gðνÞ and only the
noise term contributes to the right-hand side of Eq. (D5). If
kĝðμÞ − ĝðνÞk2 ¼ kηðμÞ − ηðνÞk2
this noise is sufficiently small, then the distance above can
be used to cluster tuples into synonymic groups. 1 1
∼ 1 þ pffiffiffiffiffi ξP ; ðD9Þ
By the independence of the noises and the central limit mvP nc
theorem, for nc ≫ 1,
pffiffiffiffiffi where ξP is some Oð1Þ noise dependent on the training set
kηðμÞ − ηðνÞk2 ∼ N ð2nc ; Oð nc ÞÞ; ðD6Þ sampling. If μ and ν are not synonyms, instead,
over independent samplings of the P training points. The 1 1
kĝðμÞ − ĝðνÞk2 ∼ 1 þ pffiffiffiffiffi ξP
g’s are also random variables over independent realizations mvP nc
of the RHM with zero mean and variance proportional to
1 1
the variance of the conditional probabilities fðαjμÞ [see þ 1 þ pffiffiffiffiffi ξRHM ; ðD10Þ
Eqs. (C40) and (C49)]: mvPc nc
FIG. 10. Sample complexity for layerwise training, m ¼ nc ¼ v, L ¼ 3, s ¼ 2. Training of an L-layers network is performed
layerwise by alternating one-step GD as described in Sec. IV C and clustering of the hidden representations. Clustering of the mv ¼ v2
representations for the different one-hot-encoded input patches is performed with the k-means algorithms. Clustered representations are
then orthogonalized and the result is given to the next one-step GD procedure. Left: test error versus number of training points. Different
colors correspond to different values of v. Center: collapse of the test error curves when rescaling the x axis by vLþ1=2. Right: analogous,
when rescaling the x axis by vLþ1. The curves show a better collapse when rescaling by vLþ1=2, suggesting that these layerwise
pffiffiffi
algorithms have an advantage of a factor v over end-to-end training with deep CNNs, for which P ¼ vLþ1 .
031001-19
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)
algorithm which alternates single GD steps to clustering of depth, even though the actual dimensionality of the mani-
the resulting representations [22,62]. More specifically, the fold did not change.
weights of the first hidden layer are updated with a single In the specific case of our data, the intrinsic dimension-
GD step while keeping all the other weights frozen. The ality of the internal representations of deep CNNs mono-
resulting representations are then clustered, so as to identify tonically decreases with depth, see Fig. 11, consistently
groups of synonymic level-1 tuples. The centroids of the with the idea proposed in the main text that the CNNs solve
ensuing clusters, which correspond to level-2 features, are the problem by reducing the effective dimensionality of
orthogonalized and used as inputs of another one-step GD data layer by layer. We attribute this monotonicity to the
protocol, which aims at identifying synonymic tuples of absence of spurious or noisy directions that might lead to
level-2 features. The procedure is iterated L times. the counterintuitive effect described above.
FIG. 11. Effective dimension of the internal representation of a CNN trained on one instance of the RHM with m ¼ nc ¼ v; L ¼ 3
resulting in Pmax ¼ 6232. Left: average nearest neighbor distance of input or network activations when probing them with a dataset of
size P. The value reported on the y axis is normalized by δ0 ¼ δðP ¼ 10Þ. The slope of δðPÞ is used as an estimate of the effective
dimension. Right: effective dimension as a function of depth. We observe a monotonic decrease, consistent with the idea that the
dimensionality of the problem is reduced by deep neural networks with depth.
031001-20
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)
FIG. 12. Sample complexity of deep CNNs, for L ¼ s ¼ 2, v ¼ 256, m ¼ 23 and different values of nc . Left: test error versus number
of training points with the color indicating the number of classes (see key). Right: sample complexity P (crosses) and law P ¼ nc mL
(black dashed).
FIG. 13. Sample complexity of deep fully connected networks with different depth, for s ¼ 2 and m ¼ nc ¼ v. Left: test error versus
number of training points. The color denotes the value of m ¼ nc ¼ v, the marker the hierarchy depth of the RHM L. Solid lines
represent networks having depth L, while dashed lines correspond to networks with depth 6 > L. Note that, in all cases, the behavior of
the test error is roughly independent of the network depth. Right: sample complexity P (crosses and circles). With respect to the case of
deep CNNs tailored to the structure of the RHM, the sample complexity of generic deep networks seems to display an additional factor
of sL independently of nc , m, and v.
031001-21
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)
analytical form of this kernel (see, e.g., Ref. [25]) and train [13] A. M. Saxe, Y. Bansal, J. Dapello, M. Advani, A.
a kernel logistic regression classifier up to convergence. Kolchinsky, B. D. Tracey, and D. D. Cox, On the informa-
Our main result is that, in the lazy regime, the generali- tion bottleneck theory of deep learning, J. Stat. Mech.
zation error stays finite even when P ≈ Pmax ; thus, kernels (2019) 124020.
suffer from the curse of dimensionality. [14] A. Ansuini, A. Laio, J. H. Macke, and D. Zoccolan, Intrinsic
Notice that the learning curves of the lazy regime follow dimension of data representations in deep neural networks,
those of the feature learning regime for P ≪ P. This is Adv. Neural Inf. Process. Syst. 32, 6111 (2019).
[15] S. Recanatesi, M. Farrell, M. Advani, T. Moore, G. Lajoie,
because the CNN kernel can also exploit local correlations
and E. Shea-Brown, Dimensionality compression and ex-
between the label and input patches [25] to improve its
pansion in deep neural networks, arXiv:1906.00443.
performance. However, unlike in the feature regime, [16] L. Petrini, A. Favero, M. Geiger, and M. Wyart, Relative
kernels cannot build a hierarchical representation, and thus stability toward diffeomorphisms indicates performance in
their test error does not converge to zero. deep nets, Adv. Neural Inf. Process. Syst. 34, 8727 (2021).
[17] U. M. Tomasini, L. Petrini, F. Cagnetta, and M. Wyart, How
deep convolutional neural networks lose spatial informa-
tion with training, Mach. Learn. Sci. Technl. 4, 045026
(2023).
[1] A. Voulodimos, N. Doulamis, A. Doulamis, and E.
[18] A. B. Patel, T. Nguyen, and R. G. Baraniuk, A probabilistic
Protopapadakis, Deep learning for computer vision: A brief
theory of deep learning, arXiv:1504.00641.
review, Comput. Intell. Neurosci. 2018, 1 (2018).
[19] E. Mossel, Deep learning and hierarchal generative mod-
[2] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou,
els, arXiv:18612.09057.
A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton
[20] H. Mhaskar, Q. Liao, and T. Poggio, When and why are
et al., Mastering the game of Go without human knowledge,
Nature (London) 550, 354 (2017). deep networks better than shallow ones?, in Proceedings
[3] U. v. Luxburg and O. Bousquet, Distance-based classifica- of the AAAI Conference on Artificial Intelligence, 2017
tion with Lipschitz functions, J. Mach. Learn. Res. 5, 669 (AAAI Press, San Francisco, 2017), Vol. 31, 10.1609/
(2004). aaai.v31i1.10913.
[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, [21] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q.
ImageNet: A large-scale hierarchical image database, in Liao, Why and when can deep—but not shallow—networks
Proceedings of the 2009 IEEE Conference on Computer avoid the curse of dimensionality: A review, Int. J. Autom.
Vision and Pattern Recognition (IEEE, New York, 2009), Comput. 14, 503 (2017).
pp. 248–255, https://ieeexplore.ieee.org/document/5206848. [22] E. Malach and S. Shalev-Shwartz, A provably correct
[5] P. Pope, C. Zhu, A. Abdelkader, M. Goldblum, and T. algorithm for deep learning that actually works, arXiv:
Goldstein, The intrinsic dimension of images and its impact 1803.09522.
on learning, in Proceedings of the International Conference [23] J. Zazo, B. Tolooshams, D. Ba, and H. J. A. Paulson,
on Learning Representations (ICLR, 2021), https:// Convolutional dictionary learning in hierarchical networks,
openreview.net/forum?id=XJk19XzGq2J. in Proceedings of the IEEE 8th International Workshop
[6] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature on Computational Advances in Multi-Sensor Adaptive
(London) 521, 436 (2015). Processing, 2019 (IEEE, New York, 2019), Vol. 131,
[7] D. C. Van Essen and J. H. Maunsell, Hierarchical organi- 10.1109/CAMSAP45676.2019.9022440.
zation and functional streams in the visual cortex, Trends [24] J. Schmidt-Hieber, Nonparametric regression using deep
Neurosci. 6, 370 (1983). neural networks with ReLU activation function, Ann.
[8] K. Grill-Spector and R. Malach, The human visual cortex, Stat. 48, 1875 (2020), https://proceedings.mlr.press/v202/
Annu. Rev. Neurosci. 27, 649 (2004).
cagnetta23a.html.
[9] M. D. Zeiler and R. Fergus, Visualizing and understanding
[25] F. Cagnetta, A. Favero, and M. Wyart, What can be learnt
convolutional networks, in Proceedings of the 13th Euro-
with wide convolutional neural networks?, in Proceedings
pean Conference Computer Vision—-ECCV 2014, Zurich
of the International Conference on Machine Learning,
(Springer, Cham, 2014), pp. 818–833, 10.1007/978-3-319-
10590-1_53. PMLR, 2023 (PMLR, 2023), pp. 3347–3379.
[10] D. Doimo, A. Glielmo, A. Ansuini, and A. Laio, Hierar- [26] U. Grenander, Elements of Pattern Theory (Johns Hopkins
chical nucleation in deep neural networks, Adv. Neural Inf. University Press, Baltimore and London, 1996), pp. xiii +
Process. Syst. 33, 7526 (2020), https://proceedings.neurips 222, 10.1002/bimj.4710390707.
.cc/paper/2020/hash/54f3bc04830d762a3b56a789b6ff62df- [27] M. Mézard, Mean-field message-passing equations in the
Abstract.html. Hopfield model and its generalizations, Phys. Rev. E 95,
[11] J. Bruna and S. Mallat, Invariant scattering convolution 022117 (2017).
networks, IEEE Trans. Pattern Anal. Mach. Intell. 35, 1872 [28] S. Goldt, M. Mézard, F. Krzakala, and L. Zdeborová,
(2013). Modeling the influence of data structure on learning in
[12] R. Shwartz-Ziv and N. Tishby, Opening the black box of neural networks: The hidden manifold model, Phys. Rev. X
deep neural networks via information, arXiv:1703.00810. 10, 041044 (2020).
031001-22
HOW DEEP NEURAL NETWORKS LEARN COMPOSITIONAL … PHYS. REV. X 14, 031001 (2024)
[29] A. M. Saxe, J. L. McClelland, and S. Ganguli, A math- [46] S. Spigler, M. Geiger, and M. Wyart, Asymptotic learning
ematical theory of semantic development in deep neural curves of kernel methods: Empirical data versus teacher–
networks, Proc. Natl. Acad. Sci. U.S.A. 116, 11537 (2019). student paradigm, J. Stat. Mech. (2020) 124001.
[30] Y. Bahri, J. Kadmon, J. Pennington, S. S. Schoenholz, J. [47] A. Favero, F. Cagnetta, and M. Wyart, Locality defeats the
Sohl-Dickstein, and S. Ganguli, Statistical mechanics of curse of dimensionality in convolutional teacher-student
deep learning, Annu. Rev. Condens. Matter Phys. 11, 501 scenarios, Adv. Neural Inf. Process. Syst. 34, 9456 (2021),
(2020). https://proceedings.neurips.cc/paper_files/paper/2021/file/
[31] A. Ingrosso and S. Goldt, Data-driven emergence of 4e8eaf897c638d519710b1691121f8cb-Paper.pdf.
convolutional structure in neural networks, Proc. Natl. [48] R. Aiudi, R. Pacelli, A. Vezzani, R. Burioni, and P. Rotondo,
Acad. Sci. U.S.A. 119, e2201854119 (2022). Local kernel renormalization as a mechanism for feature
[32] E. DeGiuli, Random language model, Phys. Rev. Lett. 122, learning in overparametrized convolutional neural net-
128301 (2019). works, arXiv:2307.11807.
[33] F. Bach, The quest for adaptivity, Machine Learning [49] A. Jacot, F. Gabriel, and C. Hongler, Neural tangent
Research Blog (2021), https://francisbach.com/quest-for- kernel: Convergence and generalization in neural networks,
adaptivity/. Adv. Neural Inf. Process. Syst. 31, 8580 (2018),
[34] L. Györfi, M. Kohler, A. Krzyzak, H. Walk et al., A https://proceedings.neurips.cc/paper_files/paper/2018/file/
Distribution-Free Theory of Nonparametric Regression, 5a4be1fa34e62bb8a6ec6b91d2462f5a-Paper.pdf.
Vol. 1 (Springer, New York, 2002), 10.1007/b97848. [50] L. Chizat, E. Oyallon, and F. Bach, On lazy training in
[35] S. Kpotufe, k-NN regression adapts to local intrinsic differentiable programming, in Advances in Neural Informa-
dimension, Adv. Neural Inf. Process. Syst. 24, 729 tion Processing Systems (Curran Associates, Inc., 2019),
(2011), https://proceedings.neurips.cc/paper_files/paper/ pp. 2937–2947, https://proceedings.neurips.cc/paper_files/
2011/file/05f971b5ec196b8c65b75d2ef8267331-Paper.pdf. paper/2019/file/ae614c557843b1df326cb29c57225459-Paper
[36] T. Hamm and I. Steinwart, Adaptive learning rates for .pdf.
support vector machines working on data with low intrinsic [51] G. Rozenberg and A. Salomaa, Handbook of Formal
dimension, Ann. Stat. 49, 3153 (2021). Languages (Springer Berlin, Heidelberg, 1997), 10.1007/
[37] M. Geiger, L. Petrini, and M. Wyart, Landscape and 978-3-642-59136-5.
training regimes in deep learning, Phys. Rep. 924, 1 [52] G. Yang and E. J. Hu, Feature learning in infinite-width
(2021). neural networks, arXiv:2011.14522.
[38] J. Paccolat, L. Petrini, M. Geiger, K. Tyloo, and M. Wyart, [53] Let us focus on the first s-dimensional patch of the input x1 ,
Geometric compression of invariant manifolds in neural which can take mv distinct values—m for each of the v
networks, J. Stat. Mech. (2021) 044001. level-2 features. For a linear transformation, insensitivity is
[39] E. Abbe, E. Boix-Adsera, M. S. Brennan, G. Bresler, and D. equivalent to the following set of constraints: For each level-
Nagaraj, The staircase property: How hierarchical struc- 2 feature μ, and x1;i encoding for one of the m level-1
ture can guide deep learning, in Advances in Neural representations generated by μ, w · x1;i ¼ cμ . Since cμ is an
Information Processing Systems, Vol. 34, edited by M. arbitrary constant, there are v × ðm − 1Þ constraints for the
Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. v × s components of w, which cannot be satisfied in general
Vaughan (Curran Associates, Inc., 2021), pp. 26989–27002, unless m ≤ ðs þ 1Þ.
https://proceedings.neurips.cc/paper_files/paper/2021/file/ [54] The notation xj ¼ μ means that the elements of the patch xj
e2db7186375992e729165726762cb4c1-Paper.pdf. encode the tuple of features μ.
[40] B. Barak, B. Edelman, S. Goel, S. Kakade, E. Malach, and C. [55] A. Damian, J. Lee, and M. Soltanolkotabi, Neural networks
Zhang, Hidden progress in deep learning: SGD learns parities can learn representations with gradient descent, in Pro-
near the computational limit, Adv. Neural Inf. Process. Syst. ceedings of Thirty-Fifth Conference on Learning Theory,
35, 21750 (2022), https://proceedings.neurips.cc/paper_files/ 2022 (PMLR, 2022), Vol. 178, p. 5413, https://proceedings
paper/2022/file/884baf65392170763b27c914087bde01- .mlr.press/v178/damian22a.html.
Paper-Conference.pdf. [56] J. Ba, M. A. Erdogdu, T. Suzuki, Z. Wang, D. Wu, and G.
[41] Y. Dandi, F. Krzakala, B. Loureiro, L. Pesce, and L. Yang, High-dimensional asymptotics of feature learning:
Stephan, Learning two-layer neural networks, one (giant) How one gradient step improves the representation,
step at a time, arXiv:2305.18270. Adv. Neural Inf. Process. Syst. 35, 37932 (2022),
[42] F. Bach, Breaking the curse of dimensionality with convex https://proceedings.neurips.cc/paper_files/paper/2022/file/
neural networks, J. Mach. Learn. Res. 18, 629 (2017), http:// f7e7fabd73b3df96c54a320862afcb78-Paper-Conference.pdf.
jmlr.org/papers/v18/14-546.html. [57] Here invariance to exchange of level-1 synonyms can
[43] E. Gardner and B. Derrida, Three unfinished works on the already be achieved at the first hidden layer due to the
optimal storage capacity of networks, J. Phys. A 22, 1983 orthogonalization of the s-dimensional patches of the input,
(1989). which makes them linearly separable.
[44] L. Zdeborová and F. Krzakala, Statistical physics of [58] M. Denil, B. Shakibi, L. Dinh, M. A. Ranzato, and N. de
inference: Thresholds and algorithms, Adv. Phys. 65, Freitas, Predicting parameters in deep learning, in Advances
453 (2016). in Neural Information Processing Systems, Vol. 26 (2013),
[45] M. Mézard, Spin glass theory and its new challenge: https://proceedings.neurips.cc/paper_files/paper/2013/file/
Structured disorder, Indian J. Phys., 1 (2023). 7fec306d1e665bc9c748b5d2b99a6e97-Paper.pdf.
031001-23
FRANCESCO CAGNETTA et al. PHYS. REV. X 14, 031001 (2024)
[59] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. [64] N. Kruger, P. Janssen, S. Kalkan, M. Lappe, A. Leonardis, J.
Fergus, Exploiting linear structure within convolutional Piater, A. J. Rodriguez-Sanchez, and L. Wiskott, Deep
networks for efficient evaluation, in Advances in Neural hierarchies in the primate visual cortex: What can we learn
Information Processing Systems, Vol. 27 (Curran Associ- for computer vision?, IEEE Trans. Pattern Anal. Mach.
ates, Inc., 2014), https://proceedings.neurips.cc/paper_files/ Intell. 35, 1847 (2012).
paper/2014/file/2afe4567e1bf64d32a5527244d104cea-Paper [65] A. Krizhevsky, Learning multiple layers of features from
.pdf. tiny images (2009), https://www.cs.toronto.edu/~kriz/
[60] X. Yu, T. Liu, X. Wang, and D. Tao, On compressing deep learning-features-2009-TR.pdf.
models by low rank and sparse decomposition, in Proceed- [66] L. Petrini and F. Cagnetta, Random hierarchy model (2023),
ings of the 2017 IEEE Conference on Computer Vision and 10.5281/zenodo.12509435; https://github.com/pcsl-epfl/
Pattern Recognition (CVPR) (IEEE, New York, 2017),
hierarchy-learning/blob/master/datasets/hierarchical.py.
Vol. 67, https://openaccess.thecvf.com/content_cvpr_2017/
[67] https://github.com/pcsl-epfl/ hierarchy-learning/blob/master/
html/Yu_On_Compressing_Deep_CVPR_2017_paper.html.
models.
[61] F. Guth, B. Ménard, G. Rochette, and S. Mallat, A rainbow
[68] https://github.com/pcsl-epfl/hierarchy-learning/blob/master.
in deep network black boxes, arXiv:2305.18512.
[69] G. Yang, Scaling limits of wide neural networks with weight
[62] E. Malach and S. Shalev-Shwartz, The implications of
local correlation on learning some deep functions, sharing: Gaussian process behavior, gradient independence,
Adv. Neural Inf. Process. Syst. 33, 1322 (2020), and neural tangent kernel derivation, arXiv:1902.04760.
https://proceedings.neurips.cc/paper_files/paper/2020/file/ [70] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G.
0e4ceef65add6cf21c0f3f9da53b71c0-Paper.pdf. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al.,
[63] S. Shalev-Shwartz, O. Shamir, and S. Shammah, Failures of PYTORCH: An imperative style, high-performance deep
gradient-based deep learning, in Proceedings of the learning library, Adv. Neural Inf. Process. Syst. 32, 8026
International Conference on Machine Learning (PMLR, (2019), https://proceedings.neurips.cc/paper_files/paper/
2017), pp. 3067–3075, https://proceedings.mlr.press/v70/ 2019/file/bdbca288fee7f92f2bfa9f7012727740-Paper.pdf.
shalev-shwartz17a.html.
031001-24