Fractal Patterns May Unravel The Intelligence in Next-Token Prediction

Fractal Patterns May Unravel the Intelligence in Next-Token Prediction
Ibrahim Alabdulmohsin 1 Vinh Q. Tran 2 Mostafa Dehghani 1
Abstract Self-Similarity. Self-similar processes were introduced by

Kolmogorov in 1940 (Kolmogorov, 1940). The notion gar-
We study the fractal structure of language, aim-
nered considerable attention during the late 1960s, thanks to
ing to provide a precise formalism for quantify-
the extensive works of Mandelbrot and his peers (Embrechts
arXiv:2402.01825v1 [cs.CL] 2 Feb 2024
ing properties that may have been previously sus-

& Maejima, 2000). Broadly speaking, an object is called
pected but not formally shown. We establish that
“self-similar” if it is invariant across scales, meaning its sta-
language is: (1) self-similar, exhibiting complex-
tistical or geometric properties stay consistent irrespective
ities at all levels of granularity, with no partic-
of the magnification applied to it (see Figure 1). Nature
ular characteristic context length, and (2) long-
and geometry furnish us with many such patterns, such as
range dependent (LRD), with a Hurst parame-
coastlines, snowflakes, the Cantor set and the Kuch curve.
ter of approximately H = 0.70 ± 0.09. Based
Despite the distinction, self-similarity is often discussed
on these findings, we argue that short-term pat-
in the context of “fractals,” another term popularized by
terns/dependencies in language, such as in para-
Mandelbrot in his seminal book The Fractal Geometry of
graphs, mirror the patterns/dependencies over
Nature (Mandelbrot, 1982). However, the two concepts are
larger scopes, like entire documents. This may
different (Gneiting & Schlather, 2004). See Section 2.
shed some light on how next-token prediction
can lead to a comprehension of the structure of In language, in particular, there have been studies arguing
text at multiple levels of granularity, from words for the presence of a self-similar structure. Nevertheless,
and clauses to broader contexts and intents. We due to the computational constraints of the past, it was not
also demonstrate that fractal parameters improve feasible to holistically model the joint probability distri-
upon perplexity-based bits-per-byte (BPB) in pre- bution of language. As such, linguists often resorted to
dicting downstream performance. We hope these rudimentary approximations in their arguments, such as by
findings offer a fresh perspective on language and substituting a word with its frequency or length (Ausloos,
the mechanisms underlying the success of LLMs. 2012), or by focusing on the recurrence of a specific, prede-
termined word (Najafi & Darooneh, 2015; Altmann et al.,
2012). These studies fall short of fully capturing the underly-
1. Introduction ing structure of language due to the simplifying assumptions
they make, as discussed in Section 4.
How does next-token prediction in large language models
(LLMs) yield remarkably intelligent behavior? Consider, Highlighting the self-similar nature of a process can have
for instance, the two models: Gemini (Anil et al., 2023a) profound implications. For instance, conventional Poisson
and GPT4 (OpenAI, 2023). These models have demon- models for Ethernet traffic were shown to fail because traffic
strated extraordinary capabilities beyond just mastering lan- was self-similar (Crovella & Bestavros, 1995; Leland et al.,
guage. Their skills extend to quantitative reasoning, cre- 1994; Paxson & Floyd, 1995; Willinger et al., 1997). In
ative content creation, document summarization, and even such cases, recognizing and quantifying this self-similarity
coding, which has prompted some researchers to ponder had practical applications, such as in the design of buffers
if there was more to intelligence than “on-the-fly impro- in network devices (Leland & Wilson, 1991). Similarly in
visation” (Bubeck et al., 2023). While understanding the language, we argue that self-similarity may offer a fresh
exceptional capabilities of LLMs is complex, particularly perspective on the mechanisms underlying the success of
given the fuzzy meaning of “intelligent” behavior, a pos- LLMs. Consider the illustrative example shown in Figure 1,
sible insight can be drawn from the study of fractals and where the task is to predict the subsequent measurement
self-similarity. We elucidate this connection in this work. in a time series, specifically predicting next tokens in a
Wikipedia article (see Section 2 for details). The three plots
1
Google Deepmind 2 Google Research. Correspondence to: in Figure 1 (top) represent different manifestations of the
Ibrahim Alabdulmohsin <ibomohsin@google.com>. same process observed across three distinct time scales.
Copyright 2024 by the author(s). Notably, we observe rich details, e.g. burstiness, in all of
1
30 is not sufficient for a predictive model to exhibit anything

resembling “intelligent” behavior. In fact, some self-similar
10
1250 1375 1500 processes, despite their intricacy across all levels of granular-
50 ity, remain entirely unpredictable. A quintessential example
is the simple Brownian motion, which is a Wiener process
10
1000 1500 2000
with independentPincrements. Its discrete analog Bn is de-
n
60 fined by Bn = i=1 εi , where εi ∼ N (0, σ 2 ). Despite
possessing rich details at all granularities, a model trained to
20 0 predict Bn cannot obviously acquire any intelligence since
2000 4000 the process itself has independent increments.
70
Thus, for intelligent behavior to manifest, the process must
have some degree of predictability or dependence as well.
50
1250 1375 1500 One classical metric for quantifying predictability in a
Time
90 stochastic process is the Hurst parameter (Hurst, 1951),
developed by the hydrologist H. E. Hurst in 1951 while
30 studying the Nile river flooding. It is generally considered
1000 1500 2000 to be a robust metric (Willinger et al., 1995), unlike for
150
instance the wavelet estimator (Abry et al., 1995) and the
periodogram method (Geweke & Porter-Hudak, 1983) that
00 2000 4000 can be sensitive to errors (Pilgrim & Taylor, 2018). As
discussed in Section 2.3, we estimate the Hurst parameter
in language to be H = 0.70 ± 0.09. For context, H can
Figure 1. Manifestations of processes across different time scales.
only take values in [0, 1]. A higher value suggests more
A region marked in red corresponds to the magnified plot above
it. TOP : The process exhibits self-similarity with rich details at all predictability or persistence in the data, while a lower Hurst
levels of granularity. It is an integral process (Xt )t∈N calculated parameter indicates more randomness (H = 0.5 for com-
from Wikipedia (see Section 2). BOTTOM: Example of a process pletely random systems).
that is not self-similar, looking smoother at larger time scales.
While it is compelling that our estimate of H in language
lies nearly midway between predictability (H = 1) and
them. Hence, for the model to successfully predict the next noise (H = 0.5), a Hurst parameter of about 0.75 turns
measurement, it must capture the behavior of the process out to occur commonly in nature, including in river dis-
at various levels of granularity. The standard approach for charges, Ethernet traffic, temperatures, precipitation, and
quantifying self-similarity is the Hölder exponent (Watkins, tree rings (Crovella & Bestavros, 1995; Feller, 1951; Aref,
2019), which we denote by S. In language, we estimate it to 1998). For agents that learn from data, such as LLMs, this
be S = 0.59 ± 0.08, confirming statistical self-similarity. value is also reminiscent of processing-based theories of
curiosity, which suggest that a sweet spot of complexity
Why is this significant? We hypothesize that since LLMs are exists (not too simple, nor too unpredictable) that facilities
trained to predict the future of a self-similar process, they or accelerates learning (Kidd & Hayden, 2015).
develop proficiency in capturing behavior across multiple
levels of granularity for two interconnected reasons. First, Importantly, predictability and self-similarity together imply
self-similarity implies that the patterns in language at the long-range dependence (LRD). This follows from the defi-
level of a paragraph are reflective of the patterns seen at the nition of self-similarity, where the patterns at small scales
level of a whole text. Hence, recognizing short-term patterns mirror those at larger scales so, for example, the correlations
may also aide in learning broader contexts. Second, because established at micro levels are also pertinent at macro levels.
language displays detailed, intricate patterns at every level LRD is arguably necessary for intelligence to emerge in a
of granularity, it would not be enough to rely only on the predictive model because processes with only short-range
immediate context of a sentence to predict the next token. dependence could be forecasted (somewhat trivially) with
Instead, the model would need to identify patterns at higher lookup tables that provide the likelihood of transitions over
levels of granularity; i.e. understand the direction of the brief sequences. By contrast, this is not possible in LRD
argument, and the broader context and intent. It must bal- processes whose contexts extend indefinitely into the past.
ance between short- and long-term contexts. Willinger et al. From a practical standpoint, we demonstrate in Section 3
(1995) and Altmann et al. (2012) argue for self-similarity in that incorporating fractal parameters can markedly enhance
language precisely because of this hierarchical nature. the ability to predict downstream performance in LLMs,
Long-range dependence. However, self-similarity by itself compared to perplexity-based metrics alone, such as bits-
2
per-byte (BPB). Specifically, we introduce a new metric bits using a large language model (LLM). Specifically, we
averaging H with 1/BPB and show that using it to predict use PaLM2-L (Unicorn) (Anil et al., 2023b) to calculate the
downstream performance can increase the adjusted R2 from probability of the next token wt conditioned on its entire pre-
approximately 0.65 when using solely BPB, to over 0.86 fix w[t−1] = (w0 , w1 , . . . , wt−1 ). By the chain rule (Cover,
with the new metric. We do not observe improvements when 1999), the corresponding number of bits assigned to wt is
predicting rankings, however. zt = − log p(wt |w[t−1] ). Unlike in prior works, which rely
on simplifications such as by substituting a word with its
Statement of Contribution. In summary, we:
length (Ausloos, 2012) or by focusing on the recurrence of
a single word (Najafi & Darooneh, 2015; Altmann et al.,
1. highlight how the fractal structure of language can 2012), we use the LLM to approximate the full joint dis-
offer a unique perspective on the intelligent behavior tribution of language. We carry out these calculations for
exhibited by LLMs, and provide a precise formalism prefixes of up to 2048 tokens (≈ 8 pages of text). Since
to quantify properties, such as long-range dependence. language is a stochastic process, the sequence of bits of each
token conditioned on its past converges asymptotically to
2. establish that language is self-similar and long-range
the average number of bits required to encode the entire
dependent. We provide concrete estimates in language
sequence (Cover, 1999). Hence, a suitable normalization,
of the three parameters: the self-similarity (Hölder) ex-
such bits-per-byte (BPB), results in a standardized descrip-
ponent, the Hurst parameter, and the fractal dimension.
tion of text, consistent across tokenizers. BPB is a widely
We also estimate the related Joseph exponent.
used as a tokenizer-agnostic metric to compare language
3. carry out a comparative study across different model modeling performance, e.g. for The Pile (Gao et al., 2020).
architectures and scales, and different domains, such Besides PaLM2, we also experiment and report on various
as ArXiv, GitHub, and Wikipedia, among others. model sizes of PaLM (Chowdhery et al., 2022) and decoder-
only T5 (Raffel et al., 2019). Namely, we report results for
4. connect fractal patterns with learning. Notably, we models: PaLM2 XXS (Gecko), XS (Otter), S (Bison), M,
show that a “median” Hurst exponent (defined in Sec- and L (Unicorn); PaLM 8B, 62B, 540B; and decoder-only
tion 3) improves upon perplexity-based bits-per-byte T5.1.1 at Base (110M), Large (341M), XL (1.2B), and XXL
(BPB) in predicting downstream performance. (5B) sizes. For PaLM and PaLM2, we use the checkpoints
pretrained in Chowdhery et al. (2022) and Anil et al. (2023b).
2. Fractal Structure of Language All T5.1.1 decoder baselines, on the other hand, are trained
with a casual language modeling objective for 262B tokens
2.1. Preliminaries of C4 (Raffel et al., 2019). More details on how we train
Suppose we have a discrete-time, stationary stochastic pro- our T5.1.1 baselines can be found in Appendix A.
cess (xt )t∈N , with E[xt ] = 0 and E[x2t ] = 1. We will refer To rely on LLM for such analysis, it must provide proba-
to (xt )t∈N as the increment process to distinguish
Pt it from the bility scores that are reasonably well-calibrated. Generally,
integral process (Xt )t∈N defined by Xt = k=0 xk . While LLMs are known to produce calibrated probability scores
(xt )t∈N and (Xt )t∈N are merely different representations of at the token level (Kadavath et al., 2022). In Figure 3, we
the same data, it is useful to keep both representations in reconfirm this by comparing the logits, − log p(word), pre-
mind. For example, self-similarity is typically studied in dicted by one of the small language models we use in our
the context of integral processes whereas long-range depen- study (PaLM-8B) with the actual log probabilities derived
dence (LRD) is defined on increment processes. from the Google Web Trillion Word Corpus (Brants & Franz,
In the literature, it is not uncommon to mistakenly equate pa- 2006) based on word frequencies. We use histogram binning
rameters that are generally different. For example, the Hurst (by grouping similar logits together) and plot their averaged
parameter has had many different definitions in the past that actual log probabilities, similar to how the expected calibra-
were not equivalent, and Mandelbrot himself had cautioned tion error (ECE) is calculated (Guo et al., 2017). Notably,
against this (Mandelbrot, 2002). The reason behind this is we find a strong agreement for the most frequently occurring
because different parameters can agree in the idealized frac- words, i.e., when the word probability exceeds p ≫ 10−9 .
tional Brownian motion setting, leading some researchers Once zt is computed for a document, we construct the incre-
to equate them in general (Watkins, 2019). We will keep ment process (xt )t∈N by normalizing zt to have a zero-mean
the self-similarity exponent S and the Hurst parameter H and unit variance. The integral process (Xt )t∈N is calcu-
separate in our discussion. lated based on (xt )t∈N , as described earlier and depicted in
Figure 1 (top). Normalizing bits (to have zero mean and
Experimental Setup. In order to establish self-similarity unit variance) models language as a random walk. It is a
and LRD in language, we convert texts into sequences of
3
Figure 2. Peak probability pϵ (τ ) is plotted against the granularity level τ (see Section 2.2). We observe a power law pϵ (τ ) ∼ τ −S in all
domains, indicating a self-similar structure, with a median self-similarity exponent of S = 0.59 ± 0.08.
30
2.2. Self-similarity exponent
Actual log-probability
20 An integral process is said to be self-similar if it ex-

hibits statistical self-similarity. More precisely, (Xt )t∈N
10 is self-similar if (Xτ t )t∈N is distributionally equivalent to
(τ S Xt )t∈N for some exponent S. Thus, scaling of time is
0
equivalent to an appropriate scaling of space. We will refer
0 10 20 30
Predicted logit
to τ as the granularity level and to the exponent S as the
self-similarity exponent. It is worth noting that S is also
called the Hölder exponent (Watkins, 2019). Many time se-
Figure 3. Comparison of PaLM-8B’s logits with actual log-
ries in nature exhibit self-similar structures, such as human
probabilities. We observe a substantial agreement except for ex-
blood pressure and heart rate (Goldberger et al., 2002).
ceedingly uncommon words with a probability < 10−9 . This is
consistent with reported findings that LLMs produce calibrated One approach for calculating the self-similarity exponent S
probability scores for tokens; e.g. (Kadavath et al., 2022). is as follows. First, fix ϵ ≪ 1 and denote the τ -increments
by (Xt+τ − Xt )t∈N . These would correspond, for instance,
to the number of bits used for clauses, sentences, paragraphs
and longer texts as τ increases. In terms of the increment
process (xt )t∈N , this corresponds to aggregating increments
standard approach used extensively in the literature in vari- into “bursts”. Let pϵ (τ ) be the probability mass of the event
ous contexts, such as in DNA sequences (Peng et al., 1992; {|Xt+τ − Xt | ≤ ϵ}t∈N . Then, S can be estimated by fitting
Roche et al., 2003; Montemurro & Pury, 2002; Kokol & a power law relation pϵ (τ ) ∼ τ −S (Watkins, 2019).
Podgorelec, 2000; Schenkel et al., 1993).
Figure 2 (top) plots the probability pϵ (τ ) against τ when
For analysis, we use The Pile validation split (Gao et al., ϵ = 5 × 10−3 using PaLM2-L. We indeed observe a power
2020), consisting of 22 subdomains such as Wikipedia and law relation; i.e. linear in a log-log scale, with a median
GitHub. We restrict analysis to sufficiently-long documents self-similarity exponent of S = 0.59 ± 0.08. Section 3
of length > 4K tokens and use the first 2K tokens only, to shows that the median S is robust to the choice of the LLM.
sidestep potential effects of the finite length of documents
and the model context. To mitigate noise, only domains with 2.3. Hurst parameter
> 1K documents are compared; we report results for them
separately and their median. We use bootstrapping (Efron The Hurst parameter H ∈ [0, 1] quantifies the degree of
& Tibshirani, 1994) to estimate the error margin. predictability or dependence over time (Hurst, 1951). It is
calculated using the so-called rescaled-range (R/S) anal-
Notation. We write f (x) ∼ xc if f (x) = xc L(x) for ysis. Let (xt )t∈N be an increment process. For
some slowly-varying function L; i.e. L(tx)/L(x) → 1 Pt Pt each
n ∈ N, write yt = xt − 1t k=0 xk and Yt = k=0 yt .
as x → ∞ for all t > 0. Examples of slowly varying The range and scale are defined, respectively, as R(n) =
functions are constants L(x) = c and L(x) = log x. When maxt≤n Yt − mint≤n Yt and S(n) = σ ({xk }k≤n ), where
f (x) ∼ xc , we will abuse terminology slightly by referring σ is the standard deviation. Then, the Hurst parameter H is
to f (x) as a power law function.
4
Figure 4. Rescaled range R(n)/S(n) is plotted against the number of normalized bits n. We observe a power law R(n)/S(n) ∼ nH in
all domains. When aggregating all datasets, H = 0.70 ± .01, indicating long-range dependence (LRD).
estimated by fitting a power law relation R(n)/S(n) ∼ nH . By convention, an object is referred to as “fractal” if D is
As stated earlier, for completely random processes, such as a different from its topological dimension. For example, the
simple Brownian motion, it can be shown that H = 1/2. In fractal dimension of the Koch curve is about 1.26 when
addition, H > 1/2 implies dependence over time (Crovella its topological dimension is 1. Fractals explain some puz-
& Bestavros, 1995; Willinger et al., 1995; Aref, 1998). zling observations, such as why estimates of the length
of the coast of Britain varied significantly from one study
Writing ρn = E[(xt+n xt ] for the autocovariance function
to another, because lengths in fractals are scale-sensitive.
of the increment process (xt )t∈N , the Hurst parameter sat-
Mandelbrot estimated the fractal dimension of the coast of
isfies H = 1 − β/2 when ρn ∼ n−β as n → ∞ (Gneiting
Britain to be 1.25 (Mandelbrot, 1967).
& Schlather, 2004; Crovella & Bestavros, 1995). Since in
self-similar processes, H > 1/2 implies long-range depen- The definition above for the fractal dimension D applies
dence (LRD), LRD is equivalent to the condition that the to geometric shapes, but an analogous definition has been
autocovariances are not summable. In terms of the integral introduced for stochastic processes. Let (xt )t∈R be a sta-
process, it can be shown that (Samorodnitsky, 2006): tionary process with autocovariance ρn . Then, its fractal
∞ dimension D is determined according to the local behavior
Var(Xn ) X
lim =1+2 ρi . (1) of ρn at the vicinity of n = 0, by first normalizing (xt )t∈R
n→∞ n i=1 to have a zero-mean and a unit variance, and modeling ρn
Hence, if H < 1/2, the auto-covariances are summable and using a power law ρn ∼ 1 − nα as n → 0+ , for α ∈ (0, 2].
Var(Xn ) grows, at most, linearly fast on n. On the other Then, the fractal dimension D ∈ [1, 2] of (xt )t∈R is defined
hand, if the process has LRD, Var(Xn ) grows superlinearly by D = 2 − α/2 (Gneiting & Schlather, 2004). A value
on n. In particular, using the Euler-Maclaurin summation D ≫ 1 indicates a significant fractal structure.
formula (Apostol, 1999; Alabdulmohsin, 2018), one obtains It can be shown that D = 2−S, where S is the self-similarity
Var(Xn ) ∼ n2H if H > 1/2. Figure 4 plots the rescaled exponent (Gneiting & Schlather, 2004). For language, this
range R(n)/S(n) against n. We observe a power law rela- gives a median fractal dimension of D = 1.41 ± 0.08.
tion with a median Hurst parameter of H = 0.70 ± 0.09.
2.5. Joseph effect
2.4. Fractal dimension
Next, we examine another related parameter that is com-
Broadly speaking, the fractal dimension of an object de- monly studied in self-similar processes. The motivation
scribes its local complexity. For a geometric object Z, such behind it comes from the fact that in processes with LRD,
as the Koch curve, let τ be a chosen scale (e.g. a short one often observes burstiness as shown in Figure 1; i.e. clus-
ruler for measuring lengths or a small square for areas). Let ters over time in which the process fully resides on one side
N (τ ) be the minimum number of objects of scale τ that of the mean, before switching to the other. This is quite
cover Z. Then, the fractal dimension of Z, n also calledo its unlike random noise, for instance, where measurements are
log N (τ )
Hausdorff dimension, is: D = − limτ →0 log τ (Pil- evenly distributed on both sides of the mean. The effect is of-
grim & Taylor, 2018). For example, a line has a fractal ten referred to as the Joseph effect, named after the biblical
dimension 1, in agreement with its topological dimension, story of the seven fat years and seven lean years (Willinger
because N (τ ) = C/τ for some constant C > 0. et al., 1995; Mandelbrot & Wallis, 1968; Watkins, 2019).
5
OpenWeb GitHub FreeLaw PileCC Wiki PubMed Math ArXiv

S 0.53 ± .05 0.60 ± .05 0.61 ± .05 0.56 ± .03 0.62 ± .02 0.60 ± .07 0.42 ± .03 0.70 ± .03
H 0.68 ± .01 0.79 ± .01 0.68 ± .00 0.70 ± .00 0.74 ± .01 0.65 ± .00 0.50 ± .01 0.72 ± .01
J 0.46 ± .01 0.49 ± .00 0.49 ± .00 0.50 ± .00 0.52 ± .00 0.44 ± .00 0.28 ± .00 0.49 ± .00
Table 1. A comparison of the fractal parameters across 8 different domains with > 1000 documents each in The Pile benchmark (see
Section 2.1 for selection criteria). DM-Mathematics is markedly different because each document consists of questions, with no LRD.
T5-Decoder PaLM PaLM2

110M 340M 1B 5B 8B 62B 540B XXS XS S M L
S .58±.06 .60±.06 .60±.05 .58±.08 .60±.07 .62±.08 .64±.08 .59±.06 .57±.08 .56±.05 .59±.07 .60±.08
H .64±.08 .64±.08 .64±.09 .64±.08 .66±.07 .68±.07 .68±.07 .66±.07 .66±.07 .67±.08 .68±.09 .69±.09
J .44±.06 .44±.06 .44±.06 .44±.06 .47±.06 .47±.06 .48±.06 .47±.06 .47±.06 .48±.07 .48±.07 .49±.08
Table 2. A comparison of the estimated median fractal parameters by various LLMs over the entire Pile validation split. Estimates are
generally robust to the choice of the LLM, but the tiny variations in median H reflect improvements in the model quality. See Section 3.
A common way to quantify the Joseph effect for integral mance that improves upon using a perplexity-based metric
processes (Xt )t∈N is as follows (Watkins, 2019). First, let like bits-per-byte (BPB) alone.
στ be the standard deviation of the τ -increments Xt+τ −Xt .
To test this hypothesis, we evaluate the 12 models in Ta-
Then, fit a power law relation στ ∼ τ J . The exponent J
ble 2 on challenging downstream zero- and few-shot bench-
here is called the Joseph exponent. In an idealized fractional
marks focusing on language understanding and reasoning.
Brownian motion, both J and the self-similarity exponent S
We include results for 0-shot (0S) and 3-shot (3S) evalu-
coincide. Figure 5 provides the detailed empirical results.
ation for BIG-Bench Hard tasks (Srivastava et al., 2022;
Overall, we obtain an estimate of J = 0.49 ± 0.08, which
Suzgun et al., 2022) reporting both direct and chain-of-
is intriguing because J = 0.5 corresponds to self-similar
thought (CoT) prompting results following Chung et al.
processes with independent increments.
(2022). In addition we report 0-shot and 5-shot (5S) MMLU
(Hendrycks et al., 2020), and 8-shot (8S) GSM8K (Cobbe
3. Analysis et al., 2021) with CoT. Raw accuracy is reported for all tasks.
BBH and MMLU scores are averaged across all 21 tasks
Comparative Analysis. Table 1 compares the estimated
and 57 subjects, respectively. All prompt templates for our
fractal parameters across different domains, such as ArXiv,
evaluation are taken from Chung et al. (2022); Longpre et al.
Github and Wikipedia. In general, most domains share simi-
(2023), which we refer the reader to for more details. We
lar self-similarity and Hurst exponents with a few exceptions.
prompt all models using a 2048 context length. See Table 9
The first notable exception is DM-Mathematics, which has a
of Appendix C for the full results.
Hurst parameter of about 0.5. To recall, a value of H = 0.5
indicates that the data does not exhibit long-range depen- The first (surprising) observation is that the median Hurst
dence (LRD). Upon closer inspection, however, a value of parameter is itself strongly correlated with the BPB scores
H = 0.5 is not surprising for DM-Mathematics because its with an absolute Pearson correlation coefficient of 0.83, even
documents consist of independent mathematical questions though the Hurst exponent is calculated after normalizing
as shown in Figure 7. The second notable observation is the all token losses to zero-mean and unit variance! Informally,
relatively larger value of H = 0.79 in GitHub, indicating this implies that second-order statistics on the sequence of
more structure in code. This is in agreement with earlier token losses of a particular model can predict its mean!
findings by Kokol & Podgorelec (2000) who estimated LRD The self-similarity exponent, by contrast, has an absolute
in computer languages to be greater than in nature language. Pearson correlation of 0.23 with BPB.
In Table 2, we compare the three fractal parameters S, H
Figure 6 displays downstream performance against both the
and J using different families of LLM and different model
median Hurst exponent and the median BPB score, where
sizes. Overall, we observe that the estimated parameters are
median values are calculated on the 8 domains in The Pile
generally robust to the choice of the architecture.
benchmark listed in Table 1. In general, both the BPB score
and the median Hurst are good predictors of downstream
Downstream Performance. By definition, fractal param- performance. However, we observe that improvements in
eters are calculated on the sequence of log-perplexity scores BPB alone without impacting the median Hurst exponent
after normalizing them to zero-mean and unit variance. do not directly translate into improvements downstream.
Hence, they may offer an assessment of downstream perfor-
6
Figure 5. The standard deviation σ of the τ -increments Xt+τ − Xt is plotted against the scale τ . We, again, observe another power law
relation σ ∼ τ J , with a Joseph exponent J = 0.49 ± 0.08.
Magnitude Ranking 2K 4K 8K
BPB H HB BPB HB 0S BBH Direct 1.81 1.68 1.76
0S MMLU 25.73 26.04 25.81
0S BBH Direct 0.785 0.841 0.883 0.958 0.958 0S BBH+MMLU 13.39 13.49 13.42
0S MMLU 0.653 0.831 0.825 0.769 0.769
0S BBH+MMLU 0.685 0.849 0.852 0.930 0.930 3S BBH Direct 21.35 24.76 23.14
3S BBH CoT 16.87 12.21 7.14
3S BBH Direct 0.767 0.895 0.926 1.000 1.000 5S MMLU 26.57 26.69 27.07
3S BBH CoT 0.881 0.892 0.979 1.000 1.000 8S GSM8K CoT 1.06 1.21 1.74
5S MMLU 0.660 0.853 0.832 0.783 0.783 FS BBH + MMLU+GSM8K 15.58 15.46 14.65
8S GSM8K CoT 0.654 0.867 0.851 0.993 0.993
FS BBH+MMLU+GSM8K 0.717 0.890 0.891 1.000 1.000
Table 4. Downstream performance comparison for three decoder-

Table 3. Adjusted R2 , which measures the proportion of varia- only T5.1.1. models pretrained on 100B tokens with either 2K, 4K,
tion in downstream performance (row) predictable by a linear or 8K context lengths.
regressor with the given input (column). The combined metric
HB = 1/BPB + H predicts downstream performance better in
all downstream metrics, compared to BPB alone. S and J do not
lengths: 2K, 4K and 8K, all observing the same number of
yield such improvements (see Appendix C). For ranking, we report
tokens per batch. We use SlimPajama-627B instead of C4
Spearman correlations, which suggest that BPB is sufficient.
because most documents in C4 are short (≈ 94% of them
are < 2K tokens in length). Refer to Appendix A for details.
These models are, then, evaluated on the same downstream
This is verified quantitatively in Table 3, which reports the
benchmarks listed in Figure 6 and Table 3. As shown in
adjusted R2 values – the proportion of variance in each
Table 4, however, we do not observe any improvements in
downstream metric that can be predicted using BPB, H, or
performance with context length in this particular setup.
by combining them together into HB = 1/BPB + H, with
BPB replaced with its reciprocal so that higher values are
better. We observe that HB yields indeed a stronger pre- 4. Related Works
dictor of downstream performance. For ranking, however,
The statistical attributes of human language have long
BPB alone is sufficient. See Appendix C for similar analysis
piqued scholarly curiosity, such as One example is Zipf’s
using the exponents S and J.
law, which Shannon leveraged to estimate the entropy of
English to be around 1 bit per letter (Shannon, 1951), but his
Context Length at Training Time. Finally, self- calculation did not consider second-order statistics. More
similarity and long-range dependence point to an intriguing recently, Eftekhari (2006) proposed a refinement to Zipf’s
possibility: the importance of training the model with ex- law, suggesting its application to letters rather than words.
tensive contexts in order to capture the fractal-nature of Another related result is Heap’s law, which states that the
language, which may elevate the model’s capabilities re- number of unique words in a document is a power law func-
gardless of the context length needed during inference. To tion of the document’s length (Heaps, 1978). However, both
test this hypothesis, we pretrain three decoder-only T5.1.1 Zipf’s and Heap’s laws are invariant to the semantic order-
models with 1B parameters on SlimPajama-627B (Sobol- ing of text, so they do not capture important aspects, such as
eva et al., 2023) for up to 100B tokens using three context long-range dependence (LRD) (Najafi & Darooneh, 2015).
7
1.1 0S BBH Direct 1.1 0S MMLU 1.1 3S BBH Direct 1.1 3S BBH CoT
Median BPB
Median BPB
Median BPB
Median BPB
0.4 0.4 0.4 0.4
0.62 0.67 0.72 0.62 0.67 0.72 0.62 0.67 0.72 0.62 0.67 0.72
Median Hurst Median Hurst Median Hurst Median Hurst
1.1 5S MMLU 1.1 8S GSM8K CoT 1.1 0S BBH+MMLU 1.1 FS BBH+MMLU+GSM8K
Median BPB
Median BPB
Median BPB
Median BPB
0.4 0.4 0.4 0.4
0.62 0.67 0.72 0.62 0.67 0.72 0.62 0.67 0.72 0.62 0.67 0.72
Median Hurst Median Hurst Median Hurst Median Hurst
Figure 6. Downstream metric, indicated by bubble size, is plotted vs. the median Hurst and the median BPB for all 12 language models.
Document I: What is the square root of 211269 to the in natural language, and suggest that its LRD is close to that
nearest integer? 460. What is the square root of of pure noise! They conjecture this was due to the use of
645374 to the nearest integer? 803... ASCII encoding. In computer languages, they observe LRD
Document II: Suppose 5*l = r - 35, -2*r + 5*l - 15 = and suggest this is because computer languages are formal.
-70. Is r a multiple of 4? True. Suppose 2*l + 11 -
Besides the above concerns in prior studies that examined
1 = 0. Does 15 divide (-2)/l - 118/(-5)? False...
the self-similar structure in language, another concern is that
they sometimes give extremely large values of the fractal
Figure 7. Two examples of documents from the DM-Mathematics dimension, sometimes even exceeding 10 (Andres, 2009).
subset of The Pile benchmark (Gao et al., 2020). Each document Such values are difficult to interpret because classical defini-
comprises of multiple independent questions. The lack of LRD in tions of the fractal dimension restrict its value to the range
this data is reflected in its Hurst parameter of H = 0.50 ± 0.01 [1, 2] for time series. We do not observe such issues in our
analysis. In our case, D = 1.41 ± 0.08.
In terms of self-similarity in language, the Menzerath- 5. Concluding Remarks

Altmann law stipulates a self-similar behavior in the follow-
In this work, we highlight intriguing insights into the un-
ing sense: when the size of a language construct increases,
derlying fractal structure of language and how it may be
the size of its constituents decreases, and this happens at all
interconnected with the intelligent behavior of LLMs. Our
scales (Najafi & Darooneh, 2015; Andres, 2009). In Ausloos
formalism quantifies properties of language that may have
(2012), the authors model texts as a time series by replacing
been suspected, but not previously formally shown. In par-
a word with its length. After that, they study the fractal
ticular, the need in LLMs to balance between short- and
behavior of language. However, replacing a word with its
long-term contexts is reflected in the self-similar structure
length is invalid because it is not translation-independent
of language, while long-range dependence is quantifiable
(i.e. one could map every word to an arbitrary token, in-
using the Hurst parameter. For instance, the absence of LRD
cluding tokens of equal length). In our work, we model
in DM-Mathematics is reflected in its Hurst parameter of
language as a time series of bits calculated from conditional
H ≈ 0.5. Interestingly, the estimated median Hurst value
entropies, reflecting the structure of the language itself.
of H = 0.70 ± 0.09 in language reflects an intriguing bal-
In Najafi & Darooneh (2015), the authors define a fractal ance between predictability and noise that is similar to many
dimension for each word. Informally, they examine the other phenomena, and combining both H with BPB together
recurrence of a single, predetermined word in texts as an yields a stronger predictor of downstream performance. We
ON/OFF time series, similar to the approach used in Alt- carry out an extensive comparative analysis across different
mann et al. (2012). However, this is only applicable to domains and model architectures, revealing that fractal pa-
individual words and cannot model higher-level clauses. rameters are generally robust. We hope that future research
For instance, it does not distinguish between the word “time” can further probe into these fractal properties, unearthing
in the phrase “once upon a time” and the word “time” in deeper understandings of the relation between intelligence
“space and time.” Kokol & Podgorelec (2000) estimate LRD and language.
8
6. Acknowledgement F., Maggioni, M., Mahendru, A., Maynez, J., Misra, V.,
Moussalem, M., Nado, Z., Nham, J., Ni, E., Nystrom, A.,
The authors would like to thank Justin Gilmer and Olivier Parrish, A., Pellat, M., Polacek, M., Polozov, A., Pope,
Bousquet for their feedback on earlier drafts of this R., Qiao, S., Reif, E., Richter, B., Riley, P., Ros, A. C.,
manuscript, and both Google Deepmind and Google Re- Roy, A., Saeta, B., Samuel, R., Shelby, R., Slone, A.,
search teams at large for the insightful discussions and pro- Smilkov, D., So, D. R., Sohn, D., Tokumine, S., Valter,
viding a supportive research environment. D., Vasudevan, V., Vodrahalli, K., Wang, X., Wang, P.,
Wang, Z., Wang, T., Wieting, J., Wu, Y., Xu, K., Xu, Y.,
7. Potential Broader Impact Xue, L., Yin, P., Yu, J., Zhang, Q., Zheng, S., Zheng,
C., Zhou, W., Zhou, D., Petrov, S., and Wu, Y. PaLM 2
This paper presents work whose goal is to advance the field technical report. arXiv:2305.10403v3 [cs.CL], 2023b.
of Machine Learning. There are many potential societal
consequences of our work, none which we feel must be Apostol, T. M. An elementary view of Euler’s summation
specifically highlighted here. formula. The American Mathematical Monthly, 106(5):
409–418, 1999.
References Aref, S. Hurst phenomenon and fractal dimensions in long-
Abry, P., Gonçalvés, P., and Flandrin, P. Wavelets, spectrum term yield data. In Conference on Applied Statistics in
analysis and 1/f processes. Wavelets and statistics, pp. Agriculture, 1998.
15–29, 1995. Ausloos, M. Generalized Hurst exponent and multifractal
Alabdulmohsin, I. M. Summability calculus: A comprehen- function of original and translated texts mapped into fre-
sive theory of fractional finite sums. Springer, 2018. quency and length time series. Physical Review E, 86(3):
031108, 2012.
Altmann, E. G., Cristadoro, G., and Esposti, M. D. On the
Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary,
origin of long-range correlations in texts. Proceedings of
C., Maclaurin, D., Necula, G., Paszke, A., VanderPlas, J.,
the National Academy of Sciences, 109(29):11582–11587,
Wanderman-Milne, S., and Zhang, Q. JAX: composable
2012.
transformations of Python+NumPy programs, 2018. URL
Andres, J. On de Saussure’s principle of linearity and visu- http://github.com/google/jax.
alization of language structures. Glottotheory, 2(2):1–14, Brants, T. and Franz, A. Web 1T 5-gram Version 1,
2009. 2006. URL https://catalog.ldc.upenn.edu/
Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu, J., Sori- LDC2006T13. Web Download. Philadelphia: Linguistic
cut, R., Schalkwyk, J., Dai, A. M., Hauth, A., Millican, Data Consortium.
K., Silver, D., Petrov, S., Johnson, M., Antonoglou, I., Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J.,
Schrittwieser, J., Glaese, A., Chen, J., Pitler, E., et al. Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lund-
Gemini: A family of highly capable multimodal models. berg, S., Nori, H., Palangi, H., Ribeiro, M. T., and Zhang,
arXiv:2312.11805v1 [cs.CL], 2023a. Y. Sparks of artificial general intelligence: Early experi-
ments with GPT-4, 2023.
Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D.,
Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra,
Chu, E., Clark, J. H., Shafey, L. E., Huang, Y., Meier- G., Roberts, A., Barham, P., Chung, H. W., Sutton, C.,
Hellstern, K., Mishra, G., Moreira, E., Omernick, M., Gehrmann, S., et al. PaLM: Scaling language modeling
Robinson, K., Ruder, S., Tay, Y., Xiao, K., Xu, Y., Zhang, with pathways. arXiv preprint arXiv:2204.02311, 2022.
Y., Abrego, G. H., Ahn, J., Austin, J., Barham, P., Botha,
J., Bradbury, J., Brahma, S., Brooks, K., Catasta, M., Chung, H. W., Hou, Le and, L. S., Zoph, B., Tay, Y., Fedus,
Cheng, Y., Cherry, C., Choquette-Choo, C. A., Chowd- W., and et al. Scaling instruction-finetuned language
hery, A., Crepy, C., Dave, S., Dehghani, M., Dev, S., models. arXiv:2210.11416v5 [cs.LG], 2022.
Devlin, J., Dı́az, M., Du, N., Dyer, E., Feinberg, V., Feng, Cobbe, K., Kosaraju, V., Bavarian, M., Hilton, J., Nakano,
F., Fienber, V., Freitag, M., Garcia, X., Gehrmann, S., R., Hesse, C., and Schulman, J. Training verifiers to
Gonzalez, L., Gur-Ari, G., Hand, S., Hashemi, H., Hou, solve math word problems. arXiv:2110.14168v2 [cs.LG],
L., Howland, J., Hu, A., Hui, J., Hurwitz, J., Isard, M., It- 2021.
tycheriah, A., Jagielski, M., Jia, W., Kenealy, K., Krikun,
M., Kudugunta, S., Lan, C., Lee, K., Lee, B., Li, E., Li, Cover, T. M. Elements of information theory. John Wiley &
M., Li, W., Li, Y., Li, J., Lim, H., Lin, H., Liu, Z., Liu, Sons, 1999.
9
Crovella, M. E. and Bestavros, A. Explaining world wide Jouppi, N. P., Yoon, D. H., Kurian, G., Li, S., Patil, N.,
web traffic self-similarity. Technical report, Boston Uni- Laudon, J., Young, C., and Patterson, D. A domain-
versity Computer Science Department, 1995. specific supercomputer for training deep neural networks.
Communications of the ACM, 63(7):67–78, 2020.
Efron, B. and Tibshirani, R. J. An introduction to the boot-
strap. CRC press, 1994. Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain,
D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma,
Eftekhari, A. Fractal geometry of texts: An initial ap- N., Tran-Johnson, E., et al. Language models (mostly)
plication to the works of Shakespeare. Journal of know what they know. arXiv preprint arXiv:2207.05221,
Quantitative Linguistics, 13(2-3):177–193, 2006. doi: 2022.
10.1080/09296170600850106.
Kidd, C. and Hayden, B. Y. The psychology and neuro-
Embrechts, P. and Maejima, M. An introduction to the the- science of curiosity. Neuron, 88(3):449–460, 2015.
ory of self-similar stochastic processes. International
journal of modern physics B, 14(12n13):1399–1420, Kokol, P. and Podgorelec, V. Complexity and human writ-
2000. ings. Complexity, 7:1–6, 2000.
Feller, W. The Asymptotic Distribution of the Range of Kolmogorov, A. N. Wienersche spiralen und einige andere
Sums of Independent Random Variables. The Annals of interessante kurven in hilbertscen raum, cr (doklady).
Mathematical Statistics, 22(3):427 – 432, 1951. doi: 10. Acad. Sci. URSS (NS), 26:115–118, 1940.
1214/aoms/1177729589. URL https://doi.org/
10.1214/aoms/1177729589. Leland, W. E. and Wilson, D. V. High time-resolution
measurement and analysis of LAN traffic: Implications
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., for LAN interconnection. In IEEE INFCOM, 1991.
Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N.,
Presser, S., and Leahy, C. The Pile: An 800GB dataset of Leland, W. E., Taqqu, M. S., Willinger, W., and Wilson, D. V.
diverse text for language modeling. arXiv:2101.00027v1 On the self-similar nature of Ethernet traffic. IEEE/ACM
[cs.CL], 2020. Transactions on networking, 2(1):1–15, 1994.
Geweke, J. and Porter-Hudak, S. The estimation and appli- Longpre, S., Hou, L., Vu, T., Webson, A., Chung, H. W., Tay,
cation of long memory time series models. Journal of Y., Zhou, D., Le, Q. V., Zoph, B., Wei, J., and Roberts,
time series analysis, 4(4):221–238, 1983. A. The flan collection: designing data and methods for
effective instruction tuning. In Proceedings of the 40th In-
Gneiting, T. and Schlather, M. Stochastic models that sepa- ternational Conference on Machine Learning, ICML’23.
rate fractal dimension and the Hurst effect. SIAM Review, JMLR.org, 2023.
46(2):269–282, 2004. doi: 10.1137/s0036144501394387.
Mandelbrot, B. How long is the coast of Britain? Statistical
Goldberger, A. L., Amaral, L. A., Hausdorff, J. M., Ivanov, self-similarity and fractional dimension. science, 156
P. C., Peng, C.-K., and Stanley, H. E. Fractal dynamics in (3775):636–638, 1967.
physiology: alterations with disease and aging. Proceed-
ings of the national academy of sciences, 99(suppl 1): Mandelbrot, B. Gaussian self-affinity and fractals: global-
2466–2472, 2002. ity, the earth, 1/f noise, and R/S. Springer Science and
Business Media, 2002.
Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On
calibration of modern neural networks. In ICML. PMLR, Mandelbrot, B. B. The fractal geometry of nature. WH
2017. freeman New York, 1982.
Heaps, H. S. Information retrieval, computational and Mandelbrot, B. B. and Wallis, J. R. Noah, Joseph, and
theoretical aspects. Academic Press, 1978. operational hydrology. Water resources research, 4(5):
909–918, 1968.
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika,
M., Song, D., and Steinhardt, J. Measuring mas- Montemurro, M. A. and Pury, P. A. Long-range fractal
sive multitask language understanding. arXiv preprint correlations in literary corpora. Fractals, 10(04):451–
arXiv:2009.03300, 2020. 461, 2002.
Hurst, H. E. Long-term storage capacity of reservoirs. Trans- Najafi, E. and Darooneh, A. H. The fractal patterns of words
actions of the American society of civil engineers, 116(1): in a text: a method for automatic keyword extraction.
770–799, 1951. PloS one, 10(6):e0130617, 2015.
10
OpenAI. GPT-4 technical report. arXiv:2303.08774v4 Soboleva, D., Al-Khateeb, F., Myers, R., Steeves, J. R.,
[cs.CL], 2023. Hestness, J., and Dey, N. SlimPajama: A 627B to-
ken cleaned and deduplicated version of RedPajama,
Paxson, V. and Floyd, S. Wide area traffic: the failure of June 2023. URL https://huggingface.co/
Poisson modeling. IEEE/ACM Transactions on network- datasets/cerebras/SlimPajama-627B.
ing, 3(3):226–244, 1995.
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid,
Peng, C.-K., Buldyrev, S. V., Goldberger, A. L., Havlin, S., A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A.,
Sciortino, F., Simons, M., and Stanley, H. E. Long-range Garriga-Alonso, A., et al. Beyond the imitation game:
correlations in nucleotide sequences. Nature, 356(6365): Quantifying and extrapolating the capabilities of language
168–170, 1992. models. arXiv preprint arXiv:2206.04615, 2022.
Pilgrim, I. and Taylor, R. P. Fractal analysis of time-series Suzgun, M., Scales, N., Scharli, N., Gehrmann, S., Tay, Y.,
data sets: Methods and challenges. In Ouadfeul, S.-A. Chung, H. W., Chowdhery, A., Le, Q. V., Chi, E. H., Zhou,
(ed.), Fractal Analysis, chapter 2. IntechOpen, Rijeka, D., and Wei, J. Challenging BIG-Bench tasks and whether
2018. doi: 10.5772/intechopen.81958. URL https: chain-of-thought can solve them. arXiv:2210.09261v1
//doi.org/10.5772/intechopen.81958. [cs.CL], 2022.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Watkins, N. Mandelbrot’s stochastic time series models.
Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring Earth and Space Science, 6(11):2044–2056, 2019.
the limits of transfer learning with a unified text-to-text Willinger, W., Taqqu, M. S., Leland, W. E., and Wilson, D. V.
transformer. arXiv:1910.10683v4 [cs.LG], 2019. Self-similarity in high-speed packet traffic: analysis and
modeling of Ethernet traffic measurements. Statistical
Roberts, A., Chung, H. W., Levskaya, A., Mishra, G., Brad-
science, pp. 67–85, 1995.
bury, J., Andor, D., Narang, S., Lester, B., Gaffney, C.,
Mohiuddin, A., Hawthorne, C., Lewkowycz, A., Salcianu, Willinger, W., Taqqu, M. S., Sherman, R., and Wilson, D. V.
A., van Zee, M., Austin, J., Goodman, S., Soares, L. B., Self-similarity through high-variability: statistical analy-
Hu, H., Tsvyashchenko, S., Chowdhery, A., Bastings, sis of Ethernet LAN traffic at the source level. IEEE/ACM
J., Bulian, J., Garcia, X., Ni, J., Chen, A., Kenealy, K., Transactions on networking, 5(1):71–86, 1997.
Clark, J. H., Lee, S., Garrette, D., Lee-Thorp, J., Raffel,
C., Shazeer, N., Ritter, M., Bosma, M., Passos, A., Maitin-
Shepard, J., Fiedel, N., Omernick, M., Saeta, B., Sepassi,
R., Spiridonov, A., Newlan, J., and Gesmundo, A. Scaling
up models and data with t5x and seqio, 2022. URL
https://arxiv.org/abs/2203.17189.
Roche, S., Bicout, D., Maciá, E., and Kats, E. Long range
correlations in DNA: scaling properties and charge trans-
fer efficiency. Physical review letters, 91(22):228101,
2003.
Samorodnitsky, G. Long memory and self-similar pro-

cesses. In Annales de la Faculté des sciences de Toulouse:
Mathématiques, volume 15, pp. 107–123, 2006.
Schenkel, A., Zhang, J., and Zhang, Y.-C. Long range

correlation in human writings. Fractals, 1(01):47–57,
1993.
Shannon, C. E. Prediction and entropy of printed English.

Bell system technical journal, 30(1):50–64, 1951.
Shazeer, N. and Stern, M. Adafactor: Adaptive learning

rates with sublinear memory cost. In International Con-
ference on Machine Learning, pp. 4596–4604. PMLR,
2018.
11
A. Experiment Details
All of our experiments are conducted in JAX/Flax (Bradbury et al., 2018) using the open source T5X framework (Roberts
et al., 2022).
T5 baselines in Table 2 and 3 are pretrained from scratch using the open source T5.1.1 decoder-only architecture from the
T5X library.1 . We pretrain using a causal language modeling objective over the C4 corpus with the default T5 vocabulary as
per Raffel et al. (2019). Training is done for 500k steps with a sequence length of 1024 and batch size of 512, resulting in a
total of 262B tokens seen during pretraining. We optimize our model with the Adafactor (Shazeer & Stern, 2018) optimizer
with an inverse square root learning rate schedule, 1k warmup steps, and an initial learning rate of 1e-2. Models are trained
using 256 TPUv5e chips (Jouppi et al., 2020).
T5 context length ablation experiments in Table 4 are trained with the same pretraining objective but over the SlimPajama-
627B corpus (Soboleva et al., 2023) and using a modified version of the T5 vocabulary that preserves whitespace and
introduces byte-fallback for out of vocabulary tokens. This is similar to Chowdhery et al. (2022), but preserving the original
T5 vocabulary. Models with sequence lengths 2048, 4096, 8192 are trained with batch sizes of 512, 256, and 128 respectively
to preserve the number of tokens seen per batch and overall training steps. We train all models for 100k steps, using the
same learning rate schedule described above. Hence, all models observe 100B tokens.
1
https://github.com/google-research/t5x/tree/main/t5x/examples/decoder_only/models
12
B. Full Results
In this section, we provide the full list of parameters calculated for each combination of LLM and domain. We use
bootstrapping (Efron & Tibshirani, 1994) to estimate the error margin.
Model OpenWebText2 Github FreeLaw Pile-CC Wikipedia PubMed Mathematics ArXiv

T5-Decoder-110M 2.89 1.82 2.45 2.88 2.80 2.36 2.28 2.70
T5-Decoder-340M 2.60 1.56 2.14 2.62 2.52 2.08 2.10 2.42
T5-Decoder-1B 2.38 1.37 1.91 2.41 2.29 1.88 2.00 2.19
T5-Decoder-5B 2.19 1.22 1.73 2.25 2.11 1.73 1.91 2.01
PaLM1-8B 2.26 0.79 1.66 2.36 2.08 1.89 1.40 2.08
PaLM1-62B 2.02 0.62 1.44 2.14 1.80 1.68 1.30 1.83
PaLM1-540B 1.88 0.54 1.33 2.01 1.58 1.57 1.25 1.68
PaLM2-XXS 2.37 0.87 1.77 2.46 2.17 1.96 1.38 1.96
PaLM2-XS 2.12 0.73 1.53 2.22 1.92 1.72 1.27 1.72
PaLM2-S 1.95 0.60 1.37 2.06 1.71 1.57 1.19 1.55
PaLM2-M 1.88 0.56 1.31 1.99 1.59 1.51 1.12 1.48
PaLM2-L 1.75 0.46 1.23 1.88 1.22 1.43 1.08 1.36
Table 5. Log-perplexity (NLL) scores evaluated on the first 2048 tokens, after trimming the first 100 tokens, of documents belonging to
each of the shown domains. Only documents with a minimum length of 4K tokens are used.

T5-Decoder-110M 0.58 ± 0.04 0.67 ± 0.03 0.51 ± 0.02 0.54 ± 0.07 0.59 ± 0.04 0.59 ± 0.03 0.51 ± 0.04 0.58 ± 0.05
T5-Decoder-340M 0.52 ± 0.03 0.59 ± 0.05 0.63 ± 0.04 0.58 ± 0.04 0.61 ± 0.03 0.61 ± 0.03 0.48 ± 0.04 0.61 ± 0.05
T5-Decoder-1B 0.54 ± 0.01 0.66 ± 0.11 0.61 ± 0.06 0.57 ± 0.06 0.59 ± 0.05 0.60 ± 0.02 0.50 ± 0.03 0.63 ± 0.02
T5-Decoder-5B 0.51 ± 0.04 0.70 ± 0.04 0.60 ± 0.04 0.58 ± 0.02 0.58 ± 0.03 0.57 ± 0.02 0.45 ± 0.02 0.67 ± 0.05
PaLM1-8B 0.56 ± 0.03 0.67 ± 0.05 0.63 ± 0.05 0.58 ± 0.01 0.55 ± 0.04 0.62 ± 0.03 0.50 ± 0.03 0.68 ± 0.07
PaLM1-62B 0.49 ± 0.03 0.65 ± 0.09 0.63 ± 0.09 0.57 ± 0.03 0.63 ± 0.05 0.61 ± 0.04 0.48 ± 0.05 0.68 ± 0.03
PaLM1-540B 0.51 ± 0.04 0.68 ± 0.09 0.64 ± 0.05 0.58 ± 0.04 0.67 ± 0.03 0.64 ± 0.08 0.48 ± 0.03 0.65 ± 0.04
PaLM2-XXS 0.53 ± 0.02 0.61 ± 0.05 0.58 ± 0.04 0.60 ± 0.04 0.57 ± 0.05 0.61 ± 0.03 0.52 ± 0.02 0.70 ± 0.04
PaLM2-XS 0.54 ± 0.04 0.57 ± 0.06 0.58 ± 0.03 0.56 ± 0.04 0.60 ± 0.04 0.57 ± 0.06 0.45 ± 0.02 0.73 ± 0.06
PaLM2-S 0.55 ± 0.02 0.55 ± 0.15 0.59 ± 0.02 0.54 ± 0.08 0.65 ± 0.04 0.58 ± 0.05 0.49 ± 0.04 0.61 ± 0.03
PaLM2-M 0.58 ± 0.02 0.62 ± 0.06 0.59 ± 0.04 0.60 ± 0.05 0.70 ± 0.03 0.56 ± 0.04 0.46 ± 0.04 0.62 ± 0.05
PaLM2-L 0.53 ± 0.05 0.60 ± 0.05 0.61 ± 0.05 0.56 ± 0.03 0.62 ± 0.02 0.60 ± 0.07 0.42 ± 0.03 0.70 ± 0.03
Table 6. Self-similarity exponent S evaluated on the first 2048 tokens, after trimming the first 100 tokens, of documents belonging to each
of the shown domains. Only documents with a minimum length of 4K tokens are used.
13

T5-Decoder-110M 0.63 ± 0.00 0.82 ± 0.01 0.62 ± 0.01 0.67 ± 0.01 0.62 ± 0.01 0.65 ± 0.00 0.54 ± 0.01 0.68 ± 0.01
T5-Decoder-340M 0.63 ± 0.01 0.82 ± 0.01 0.62 ± 0.00 0.67 ± 0.00 0.62 ± 0.01 0.64 ± 0.01 0.54 ± 0.00 0.67 ± 0.01
T5-Decoder-1B 0.63 ± 0.01 0.83 ± 0.01 0.63 ± 0.01 0.67 ± 0.00 0.62 ± 0.01 0.64 ± 0.00 0.54 ± 0.00 0.67 ± 0.00
T5-Decoder-5B 0.63 ± 0.01 0.82 ± 0.00 0.62 ± 0.01 0.67 ± 0.01 0.62 ± 0.01 0.64 ± 0.01 0.54 ± 0.00 0.67 ± 0.00
PaLM1-8B 0.65 ± 0.01 0.81 ± 0.01 0.66 ± 0.00 0.68 ± 0.01 0.66 ± 0.00 0.65 ± 0.01 0.57 ± 0.00 0.69 ± 0.01
PaLM1-62B 0.66 ± 0.01 0.80 ± 0.00 0.67 ± 0.01 0.69 ± 0.01 0.68 ± 0.00 0.65 ± 0.00 0.57 ± 0.00 0.70 ± 0.00
PaLM1-540B 0.67 ± 0.00 0.79 ± 0.01 0.68 ± 0.00 0.69 ± 0.01 0.71 ± 0.01 0.65 ± 0.01 0.56 ± 0.00 0.70 ± 0.01
PaLM2-XXS 0.65 ± 0.01 0.81 ± 0.01 0.65 ± 0.01 0.68 ± 0.01 0.66 ± 0.01 0.65 ± 0.01 0.58 ± 0.00 0.71 ± 0.01
PaLM2-XS 0.65 ± 0.01 0.81 ± 0.01 0.66 ± 0.01 0.68 ± 0.01 0.67 ± 0.00 0.65 ± 0.00 0.56 ± 0.01 0.71 ± 0.01
PaLM2-S 0.67 ± 0.01 0.80 ± 0.01 0.66 ± 0.01 0.69 ± 0.00 0.68 ± 0.01 0.65 ± 0.01 0.54 ± 0.00 0.71 ± 0.00
PaLM2-M 0.67 ± 0.01 0.80 ± 0.01 0.67 ± 0.01 0.70 ± 0.01 0.70 ± 0.01 0.65 ± 0.01 0.52 ± 0.01 0.72 ± 0.01
PaLM2-L 0.68 ± 0.01 0.79 ± 0.01 0.68 ± 0.00 0.70 ± 0.00 0.74 ± 0.01 0.65 ± 0.00 0.50 ± 0.01 0.72 ± 0.01
Table 7. Hurst exponent H evaluated on the first 2048 tokens, after trimming the first 100 tokens, of documents belonging to each of the
shown domains. Only documents with a minimum length of 4K tokens are used.

T5-Decoder-110M 0.44 ± 0.01 0.53 ± 0.00 0.42 ± 0.00 0.49 ± 0.01 0.45 ± 0.00 0.43 ± 0.00 0.33 ± 0.00 0.45 ± 0.00
T5-Decoder-340M 0.44 ± 0.02 0.53 ± 0.00 0.43 ± 0.00 0.49 ± 0.00 0.45 ± 0.01 0.43 ± 0.00 0.33 ± 0.00 0.45 ± 0.00
T5-Decoder-1B 0.43 ± 0.01 0.53 ± 0.00 0.43 ± 0.01 0.49 ± 0.01 0.45 ± 0.01 0.42 ± 0.00 0.33 ± 0.00 0.45 ± 0.01
T5-Decoder-5B 0.43 ± 0.01 0.53 ± 0.00 0.44 ± 0.00 0.49 ± 0.01 0.45 ± 0.00 0.42 ± 0.00 0.34 ± 0.00 0.45 ± 0.00
PaLM1-8B 0.45 ± 0.00 0.51 ± 0.00 0.46 ± 0.00 0.49 ± 0.01 0.48 ± 0.01 0.44 ± 0.01 0.34 ± 0.00 0.48 ± 0.01
PaLM1-62B 0.45 ± 0.00 0.50 ± 0.01 0.47 ± 0.00 0.49 ± 0.01 0.49 ± 0.00 0.44 ± 0.00 0.33 ± 0.00 0.48 ± 0.01
PaLM1-540B 0.46 ± 0.01 0.49 ± 0.01 0.47 ± 0.00 0.50 ± 0.01 0.50 ± 0.00 0.44 ± 0.00 0.33 ± 0.01 0.48 ± 0.00
PaLM2-XXS 0.44 ± 0.01 0.50 ± 0.00 0.45 ± 0.00 0.50 ± 0.01 0.48 ± 0.00 0.45 ± 0.00 0.34 ± 0.00 0.49 ± 0.00
PaLM2-XS 0.45 ± 0.01 0.50 ± 0.01 0.46 ± 0.01 0.49 ± 0.00 0.48 ± 0.00 0.44 ± 0.00 0.33 ± 0.01 0.49 ± 0.00
PaLM2-S 0.45 ± 0.00 0.49 ± 0.00 0.47 ± 0.00 0.50 ± 0.01 0.50 ± 0.01 0.44 ± 0.00 0.31 ± 0.00 0.49 ± 0.00
PaLM2-M 0.45 ± 0.01 0.49 ± 0.01 0.48 ± 0.01 0.50 ± 0.01 0.50 ± 0.00 0.44 ± 0.00 0.29 ± 0.00 0.49 ± 0.01
PaLM2-L 0.46 ± 0.01 0.49 ± 0.00 0.49 ± 0.00 0.50 ± 0.00 0.52 ± 0.00 0.44 ± 0.00 0.28 ± 0.00 0.49 ± 0.00
Table 8. Joseph exponent J evaluated on the first 2048 tokens, after trimming the first 100 tokens, of documents belonging to each of the
shown domains. Only documents with a minimum length of 4K tokens are used.
C. Predicting Downstream Performance

Table 9 presents detailed downstream performance results, along with corresponding upstream metrics.
In Table 10, we repeat the same analysis in Section 3 using the adjusted R2 coefficient, but with the self-similarity S and
Joseph exponents J. Unlike in the median Hurst exponent, we do not observe any improvement when combining perplexity
scores with the self-similarity exponent S or the Joseph exponent J.
14
Model BPB 0S BBH 0S BBH 0S 3S BBH 3S BBH 5S 8S 0S BBH FS BBH

Direct CoT MMLU Direct CoT MMLU GSM8K +MMLU +MMLU
CoT +GSM8K
T5-Decoder-110M 1.11 0.83 0.11 25.65 21.36 5.69 25.62 0.91 13.06 13.35
T5-Decoder-340M 1.00 0.96 0.17 25.72 23.57 10.03 25.98 1.59 13.14 14.79
T5-Decoder-1B 0.92 1.29 0.14 25.99 24.26 13.19 24.82 1.14 13.35 14.90
T5-Decoder-5B 0.85 2.13 0.48 24.41 24.76 18.05 25.63 2.20 12.86 16.41
PaLM1-8B 0.78 6.46 1.21 23.53 32.18 27.60 24.56 5.16 13.68 19.87
PaLM1-62B 0.70 13.79 0.83 51.86 39.51 39.70 54.78 29.57 29.59 41.32
PaLM1-540B 0.66 23.26 4.72 67.78 52.44 56.02 70.50 56.79 40.89 60.51
PaLM2-XXS 0.81 8.99 0.13 25.26 30.71 26.08 24.72 2.96 14.91 18.69
PaLM2-XS 0.73 16.68 0.95 49.69 38.28 37.64 47.42 22.14 29.25 35.84
PaLM2-S 0.67 23.60 4.24 69.89 48.88 50.88 68.12 50.49 41.91 56.16
PaLM2-M 0.65 21.32 5.70 69.62 52.49 56.04 69.33 59.21 41.57 60.94
PaLM2-L 0.61 24.00 10.19 79.10 66.34 66.66 78.64 80.36 48.10 75.17
Table 9. Full downstream few-shot evaluation results compared to upstream BPB. Here, BPB is computed over The Pile validation split
using the first 2048 tokens of every document. All evaluation results are reported as raw (un-normalized) accuracy.
Please note that our results are not directly comparable to all previous published results for the same models; please cite the
original results from (Chowdhery et al., 2022; Anil et al., 2023b). Here, we only aim for a fair comparison between models: only
pretrained models without instruction tuning are used, we do not optimize any prompts for each model, and we evaluate all models using
only a 2K sequence length.
BPB S J BPB+S BPB+J

0S BBH Direct 0.785 -0.060 0.673 0.761 0.794
0S MMLU 0.653 -0.067 0.426 0.614 0.614
0S BBH+MMLU 0.685 -0.065 0.472 0.650 0.651
3S BBH Direct 0.767 -0.030 0.599 0.744 0.754
3S BBH CoT 0.881 -0.026 0.678 0.870 0.879
5S MMLU 0.660 -0.044 0.421 0.624 0.622
8S GSM8K CoT 0.654 -0.037 0.427 0.619 0.616
FS BBH + MMLU+GSM8K 0.717 -0.036 0.489 0.687 0.686
Table 10. Adjusted R2 , which measures the proportion of variation in downstream performance (row) that is predictable from the given
input(s) (column) using a trained linear regressor. Unlike in the median Hurst exponent, we do not observe any improvement when
combining BPB scores with the self-similarity exponent S or the Joseph exponent J.
15

Fractal Patterns May Unravel The Intelligence in Next-Token Prediction

Uploaded by

Copyright:

Available Formats

Fractal Patterns May Unravel The Intelligence in Next-Token Prediction

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fractal Patterns May Unravel The Intelligence in Next-Token Prediction

Uploaded by

Copyright:

Available Formats

Fractal Patterns May Unravel the Intelligence in Next-Token Prediction

Ibrahim Alabdulmohsin 1 Vinh Q. Tran 2 Mostafa Dehghani 1

Abstract Self-Similarity. Self-similar processes were introduced by

ing properties that may have been previously sus-

30 is not sufficient for a predictive model to exhibit anything

20 An integral process is said to be self-similar if it ex-

OpenWeb GitHub FreeLaw PileCC Wiki PubMed Math ArXiv

T5-Decoder PaLM PaLM2

Table 4. Downstream performance comparison for three decoder-

In terms of self-similarity in language, the Menzerath- 5. Concluding Remarks

Samorodnitsky, G. Long memory and self-similar pro-

Schenkel, A., Zhang, J., and Zhang, Y.-C. Long range

Shannon, C. E. Prediction and entropy of printed English.

Shazeer, N. and Stern, M. Adafactor: Adaptive learning

Model OpenWebText2 Github FreeLaw Pile-CC Wikipedia PubMed Mathematics ArXiv

Model OpenWebText2 Github FreeLaw Pile-CC Wikipedia PubMed Mathematics ArXiv

Model OpenWebText2 Github FreeLaw Pile-CC Wikipedia PubMed Mathematics ArXiv

Model OpenWebText2 Github FreeLaw Pile-CC Wikipedia PubMed Mathematics ArXiv

C. Predicting Downstream Performance

Model BPB 0S BBH 0S BBH 0S 3S BBH 3S BBH 5S 8S 0S BBH FS BBH

BPB S J BPB+S BPB+J

You might also like