The Small-World of Human Language

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

The Small-World of Human Language

Ramon Ferrer i Cancho∗ and Ricard V. Solé∗,+



Complex Systems Research Group, FEN-UPC, Campus Nord, B4-B5, Barcelona 08034 SPAIN
+
Santa Fe Institute, 1399 Hyde Park Road, New Mexico 87501, USA

Words in human language interact within sentences in non-random ways, and allow humans to
construct an astronomic variety of sentences from a limited number of discrete units. This construc-
tion process is extremely fast and robust. The coocurrence of words within sentences reflect language
organization in a subttle manner which can be described in terms of a graph of word interactions.
Here we show that such graph displays two important features recently found in a disparate number
of complex systems: (a) The so called small world effect. In particular, the average distance between
two words d (i.e. the average minimum number of jumps to be made from an arbitrary word to
another) is shown to be d ≈ 2 − 3, in spite that the human brain can store many thousands. (b) A
scale-free distribution of degrees. The known dramatic effects of disconnecting the most connected
vertices in such networks can be identified in some language disorders. These observations suggest
some unexpected features of language organization that might reflect the evolutionary and social
history of lexicons and the origins of their flexibility and combinatorial nature.

Keywords: Small-world, Scaling, Lexical networks, Human language

I. INTRODUCTION geographic location, social context, education and pro-


fession.
The emergence of human language is one of the ma-
jor transitions in evolution (Smith & Szäthmáry, 1997).
Living humans posses a unique symbolic mind capable
ΩL
of language which is not shared by any other species.
Over two million years of hominid evolution, a coevo-
lutionary exchange between languages and brains took
place (Deacon, 1997). This process involved the (pos-
sible sudden) transition from non-syntactic to syntactic
communication (Nowak & Krakauer, 1999; Nowak et al.,
2000). Human language allows the construction of a vir-
tually infinite range of combinations from a limited set of
basic units. The process of sentence generation is aston-
ishingly rapid and robust and indicates that we are able
to rapidly gather words to form sentences in a highly
reliable fashion.
A complete theory of language requires a theoretical
understanding of its implicit statistical regularities. The
best known of them is the Zipf’s law, which stays that
the frequency of words decays as a power function of its
rank (Zipf, 1972). However, in spite of its relevance and
universality (Balasubrahmanyan & Naranan, 1996), such
law can be obtained from a variety of mechanisms (Nico- FIG. 1. A possible pattern of wiring in ΩL . Black nodes are
lis, 1991; Simon, 1955; Li, 1992) and does not provide common words and white nodes are rare words. Two words
are linked if they cooccur significantly.
deep insight about the organization of language. The
reason is that information transmission is organized into
sentences, made by words in interaction. Being the primary goal of a lexicon to achieve a sucess-
Human brains store lexicons usually formed by thou- ful communication, there must exist a common lexi-
sands of words. Estimates are in the range 104 − 105 con for sucessful basic communication between speakers,
words (Romaine, 1992; Miller & Gildea, 1987). Besides, hereafter named a kernel lexicon, to surmount the limi-
the contents of the lexicon of individuals of the same tations imposed by the factors mentioned above. Obvi-
language vary depending on many factors such as age, ously, the better candidates to form this lexicon are the

1
most frequent words. Actually, the analysis of multi- lead to the robustness and homeostasis observed in na-
speaker corpora shows two different regimes dividing ture (Watts & Strogatz, 1998). The SW pattern can
words into basic and specialized words (Ferrer & Solé, be detected from the analysis of two basic statistical
2000). properties: the so called clustering coefficient Cv and
Words interact in many ways. Some words cooccur the path length d. Let us consider the set of links ξij
with certain words with higher probability than with oth- (i, j = 1, ..., NL ), where ξij = 1 if a link exists and zero
ers and coocurrence is not of trivial nature, i.e. it is not otherwise and that the average number of links per word
a straigtforward implication of the known frequency dis- is k̄. Let us indicate by Γi = {si |ξij = 1} the set of
tribution of words. If a text is scrambled the frequency nearest neighbors of a word wi ∈ WL . The clustering
distribution is mantained but its content will not make coefficient for this word is defined as the number of con-
sense. nections between the words wj ∈ Γi . By defining
In this paper we show that the coocurrence of words in  
NL
sentences relies on the network structure of the lexicon X X
Li = ξij  ξjk  (1)
whose properties are analyzed in depth. As we will show
j=1 k∈Γi ;j<k
in this paper, human language can be described in terms
of a graph of word interactions. This graph has some un- we have:
expected properties (shared by other biologic and tech-
Li
nologic networks (Strogatz, 2001)) that might underly its cv (i) = |Γi |

diversity and flexibility and open new questions about its 2
origins and organization. so that the clustering coefficient is the average over WL :
NL
1 X
II. GRAPH PROPERTIES OF HUMAN
Cv = cv (i) (2)
NL i=1
LANGUAGE
and measures the average fraction of pairs of neighbors
of a node that are also neighbors of each other.
Words cooccur in sentences. Many coocurrences are
The second measure is easily defined. Given two words
due to syntactical relationships between words, e.g.
wi , wj ∈ WL , let dmin (i, j) the minimum path length con-
head-modifier or dependency relationships (Melčuck,
necting these two words in ΩL . The average path length
1989). Some others are due to stereotiped expressions
of a word will be defined as
or collocations that work together (e.g. take it esasy,
New York). We will define links as significative cooc- NL
1 X
currences between words. We do not seek to provide a dv (i) = dmin (i, j) (3)
NL j=1
detailed list of the origins and linguistic interpretation of
such significative cooccurrences but in showing that they and thus the average path length d will be:
exist and can be captured using quantitative measures of
NL
correlation regardless of their nature. A first approach 1 X
for estimating the network of the lexicon is to consider d= dv (i) (4)
NL i=1
that there is a link between every pair of neighbouring
words (at the risk of capturing spurious correlations). Graphs with Small World structure are highly clus-
Let us consider the graph of human language, ΩL , as tered but d will be small. Random graphs (where nodes
defined by ΩL = (WL , EL ), where WL = {wi }, (i = are randomly wired) are not clustered and have also short
1, ..., NL ) is the set of NL words and EL = {{wi , wj }} d (Watts, 1999). At the other extreme, regular lattices
is the set of edges/connections between words. Here with only nearest-neighbor connections among units, are
ξij = {wi , wj } indicates that there is an edge (and thus typically clustered and exhibit long paths. It has been
a link) between words wi and wj . Two connected words shown, however, that a regular lattice can be transformed
are adjacent and the degree of a given word is the num- into a SW if a small fraction of nodes are rewired to ran-
ber of edges that connect it with other words. Figure 1 domly chosen nodes. Thus a small degree of disorder
shows how such a network would look like. generates short paths (as in the random case) but retain-
Recent research on a number biological, social and ing the regular pattern (Watts and Strogatz, 1998).
technological graphs revealed that they share a com- For random graphs that Cvrand ≈ k̄/N . For SW graphs,
mon feature: the so called small world (SW) property d is close to the one expected from random graphs, drand ,
(Watts & Strogatz, 1998; Watts, 1999; Newman, 2000). with the same k̄ and Cv  Cvrand . These two conditions
Small world graphs have a number of surprising proper- are taken as the standard definition of SW. SW graphs
ties that make them specially relevant to understand how have been shown to be present in both social and bi-
interactions among individuals, metabolites or species ological networks (Jeong et al., 2000; Montoya & Solé,

2
2000; Solé & Montoya, 2000; Newman, 2000; Strogatz, • We are not interested in all the relations happening
2001). Besides, some of these networks also exhibit scal- in a particular sentence. Our goal is to capture as
ing in their degree distribution. In other words, the prob- much links as possible through an automatic proce-
ability P (k) of having a node with degree k scales as dure. If the corpus is big enough, the macroscopic
P (k) ≈ k −γ . We have found that the graph of human properties of the network should emerge.
language displays similar properties. This second prop-
erty has been shown to be related with an extremely high • Being syntactic dependencies non-crossing (Hud-
stability against perturbations directed to randomly cho- son, 1984; Melčuck, 1989), a long distance syntactic
sen nodes and a high fragility when perturbations are di- link implies the existence of lower distance syntac-
rected to highly connected ones (Albert et al., 2000). As tic links. In contrast, a short distance link do not
we will show here, ΩL exhibits both SW structure and a imply a long-distance link.
power laws in P (k).
The technique can be improved by choosing only the
pairs of consecutive words whose mutual cooccurrence is
III. LINK COLLECTION larger than the one expected from their chance. This
can be measured with the condition pij > pi pj which
The most correlated words in a sentence are the closest. defines the presence of real correlations beyond the ex-
A decision must be taken about the maximum distance pected from a random ordering of words. If a pair of
considered for forming links. If the distance is long, the words cooccurs less times than what it would be expected
risk of capturing spurious coocurrences increases. If the when independence between such words is assumed, the
distance is too short, certain strong coocurrences can be pair is considered to be spurious. Graphs in which this
sistematically not taken into account. We decided the condition is used will be called restricted (unrestricted
maximum distance according to the minimum distance otherwise).
at which most of the coocurrences are likely to happen:
• Many coocurrences take place at distance 1, IV. SCALING AND SW PATTERNS
e.g. red flowers (adjective-noun), the/this house
(article/determiner-noun), stay here (verb-adverb), The networks resulting from the basic and improved
getting dark (verb-adjective), can see (modal-verb), methods will be called, respectively, the unrestricted
... word network (UWN) and the restricted word net-
• Many coocurrences take place at distance 2, e.g. hit work (RWN). They have N (U W N ) = 478, 773 and
the ball (verb object), Mary usually cries (subject- N (RW N ) = 460, 902 nodes with E(U W N ) = 1.77 · 107
verb), table of wood (noun-noun through a prepo- and E(RW N ) = 1.61 · 107 edges, respectively. With av-
sitional phrase), live in Boston (verb-noun), . . . erage connectivities of k̄uwn = 74, 2 and k̄rwn = 70, 13,
their clustering and path lengths are indicated in Table
Long distance correlations, i.e. at distance greater 1.
than two, have been shown to take place in human sen- Figure 2 shows the distribution of degrees of both the
tences (Chomsky, 1957). Here we stop our seek at dis- UWN and RWN obtained after processing about 3/4 of
tance two. The reason is fourfold: the million words of the British National Corpus (about
• Considering whatever distance requires an auto- 70 million words). The obvious limitations of our meth-
matic procedure for acomplishing the task of cap- ods are overcome by the use of a big amount of data. The
turing the relevant links. We do not know of any distribution of connectivies of UWN and RWN decays
computational technique that succesfully performs with two different average exponents each, γ1 = −1.50
this task for a general case. From the practical for the first regime and γ2 = −2.70 for the second regime,
point of view, a context of two words is considered respectively. The exponent in the second regime is simi-
to be the lowest distance at which most of the im- lar to that of the so-called Barabási-Albert (BA) model
provement of disambiguation methods is achieved (γBA = −3) (Barabási & Albert, 1999). The BA model is
(Kaplan, 1955; Choueka & Lusignan, 1985). an independent rediscovery of earlier work by Herbert Si-
mon on systems with skewed distributions (Simon, 1955).
• Our method fails to capture the exact relations Using the rule of preferential attachment, they showed
happening in a particular sentence but captures (al- that scale-free distributions are obtained. The rule sim-
most) every possible type of links. The type of the ply assumes that new nodes in the growing network are
link is determined by the syntactic categories/roles preferentially attached to an existing node with a prob-
of the intervening words. Very few types of links ability proportional to the degree of such node.
(if any) are observed at distance greater than 2 and Furthermore, word networks have small-word features.
not at lower distances. The average minimum distance between vertices is below

3
3 (2.63 for the UWN and 2.67 for the RWN), so reaching similar to the BA model. For the most frequent words,
whatever vertex involves less than three jumps on av-
erage. This is significantly important, since the network k ∝ f 0.66
contains about 4.7·105 different words. Clustering (0.687
for the UWN and 0.437 for the RWN) is far from the ran- where k is the degree and f is the frequency of the word.
domness expectation (1.55 · 10−4 for both the UWN and We can then recast the frequency effect in terms of the
the RWN) in both cases. degree as the higher the degree of a word, the higher its
availability. In other words, links including highly con-
As far as we know, this is the first time that such a nected words are preferred. Inset in Figure 2 shows the
statistically significant property has been reported about complete relationship between f and k in RWN.
the organization of human language. In spite of the huge
amount of words that can be stored by a given human, 0.003
0
whatever word in the lexicon can be reached with less 10
1.5
−1 1.75
than three intermediate words on average. If a word is 10 2

P(k)
reached during communication, jumping to another word 0.002
−2
10
requires very few steps. Speed during speech production −3
10
−3.07

P(k)
is important and can be more easily achieved if interven- −4
10 3 4
ing words are close each other in the underlying structure 10 10
0.001 k
used for the construcion of sentences. On the other hand,
richness is another quality of a powerful communication.
Although words are preferably choosen from the kernel
0.000
lexicon, external words are at a short distance, so rich 0 1000 2000 3000 4000 5000 6000
communication based on the word network can be at- k

tained with little increase in effort.


FIG. 3. Connectivity distribution for the kernel word net-
10
0
work, formed by the 5000 most connected vertices in RWN.
10
−1
Restricted graph
Unrestricted graph
Inset: power law tail for k > k calculated by grouping in
−2 −1.50 powers of 1.5, 1.75 and 2. The exponent of the power tail
10
is γKW N ≈ −3, suggesting that preferential attachment is at
−3
10 play.
−4
10
6
The exponent of UWN and RWN is closer to γBA = −3
P(k)

10
−5 10

10
−6
4
−2.70 in the second regime of the distribution which is where
10
10
−7
k 0.66
the frequency effect makes much more sense. The kernel
−8
10
2
lexicon contains words common to the whole community
10
−9 0
0.80 of speakers. Beyond the kernel, a certain word is un-
10 10
10
−8
10
−6
10
−4
10
−2
10
0
known for one speaker and familiar for the another. The
−10 f
10 0
10 10
1
10
2
10
3
10
4
10
5 6
10 recency effect then cannot be applied for all the individu-
Degree k als contributing to shape the underlying lexicon network.
It is thus expected that the network formed exclusively
FIG. 2. Connectivity distribution for the unrestricted word by the interaction of kernel words, hereafter referred as
network (circles) and the restricted word network (squares). the kernel word network (KWN), better agrees with the
Points are grouped by powers of two. Inset: average degree as predictions that can be performed when preferential at-
a function of frequency. Degree increases as a function with tachment is at play. Figure 3 shows the log-normal ap-
frequency with exponent 0.80 for the first domain and 0.66 pearance of the connectivity distribution. The power tail
for the second one.
has exponent γKW N ≈ −3, consistent with the Barabási-
Albert model (Barabási & Albert, 1999) and the differ-
It is well known that the more frequent a word, the ences respect to it require special attention. It is im-
more available it is for production (Brown & McNeil, portant to notice that the kernel lexicon is a versatile
1966; Kempen & Huijbers, 1983) and comprehension subset of the repertoire of individual speakers. A few
(Forster & Chambers, 1973; Scarborough et al., 1977) thousand words must be able to say everything or almost
processes. This phenomenon is known as the frequency everything. Even when lexicons become very small, i.e.
(referring to the whole individual’s experience) or re- pidgin languages whose lexicons do not usually exceed
cency (referring to the recent individual’s experience) ef- about 1,000 words (Romaine, 1992), it has been pointed
fect (Akmajian et al., 1995). This phenomenon will serve out they allow to say everything that can be said in a
us to show that preferential attachment is very likely to complex lexicon (e.g. English) at the expense of high
be shaping the scale-free distribution of degrees in a way redundancy (recurring to circumlocution). The average

4
connectivity in the kernel is k̄ = 1219. A first conse- sociated to removal of highly connected words. Although
quence is that words with low connectivity must be rare. scale-free networks are very tolerant to random removal
Having rather useless words in this critical subset is an of vertices, if deletion is directed to the most connected
enormous waste. Once connected words become frequent vertices the network gets broken into pieces (Albert et al.,
in the distribution, the network organizes in a scale-free 2000).
way. We believe that the scale-freeness is responsible for It is known that function words omission is often ac-
the ability-to-say-everything of the kernel. A non-trivial companied by substitutions of such words. Patients in
network is needed since every word on average is con- which substitutions predominate and speach is fluent are
nected to 24% of the rest of the kernel words. said to undergo paragrammatism (Caplan, 1994). We
suggest that paragramatism recovers fluency (i.e. low
average word-word distance) by unapropriately using the
V. DISCUSSION remaining highly connected vertices and thus often pro-
ducing substitutions of words during discourse.
We have shown that the graph connecting words in lan-
guage exhibits the same statistical features than other
complex networks. The short distance between words ACKNOWLEDGMENTS
arising from the SW structure strongly suggests that
language evolution might have involved the selection of
Authors want to specially thank G. Miller and F.
a given graph of connections between words. Future
Diéguez for valuable discussions and are also grateful
work should address this problem theoretically, perhaps
to D. Krakauer, A. Lloyd, M. Nowak, F. Reina and J.
using an evolutionary language game model (Nowak &
Roselló for helpful comments. RFC acknowledges the
Krakauer, 1999; Nowak et al., 2000) where a pay-off as-
hospitality of the Institute for Advanced Study. This
sociated to the graph structure is introduced. Concern-
work was supported by the Santa Fe Institute (RVS) and
ing the scaling in P (k) and the observed exponents, this
grants of the Generalitat de Catalunya (FI/2000-00393,
pattern also calls for an evolutionary explanation. The
RFC) and the CICYT (PB97-0693, RVS).
word network is the result of a growth process in which
new words are added and are likely to be linked to highly
connected existing words.
References
If the small-world features derive from optimal navi-
gation needs, two predictions can be formulated. First,
the existence of words whose main purpose is to speed-up Akmajian, A., Demers, R. A., Farmer, A. K., & Harnish,
navigation. Second, deriving from the first, the existence R. M. (1995). Linguistics. an introduction to lan-
of brain disorders characterized by navigation deficits in guage and communication. MIT Press. (Chapter
which such words are involved. The best candidates for 2)
answering the first question are the so-called particles, a
subset of the function words (e.g. articles, prepositions, Albert, R., Jeong, H., & Barabási, A.-L. (2000). Error
conjunctions) formed by the most frequent among them and attack tolerance of complex networks. Nature,
(e.g. ant, the, of, . . . )1 . These words are characterized 406, 378-381.
by a very low or zero semantic content. Although they
Balasubrahmanyan, V. K., & Naranan, S. (1996). Quan-
are supposed to contribute to the sentence structure, they
titative linguistics and complex system studies.
are not generally crucial for sentence understanding. A
Journal of Quantitative Linguistics, 3 (3), 177-228.
compelling test of this statement is that particles are the
first words to be suppressed in telegraphic speech (Ak- Barabási, A.-L., & Albert, R. (1999). Emergence of scal-
majian et al., 1995). ing in random networks. Science, 286, 509-511.
The answer to the second prediction is agrammatism,
a kind of aphasia in which speech is nonfluent, labored, Brown, R., & McNeil, D. (1966). The ”tip of the tongue”
halting and lacking of function words (and thus of par- phenomenon. Journal of Verbal Learning and Ver-
ticles). Agrammatism is the only syndrome in which bal Behaviour, 5, 325-337.
function words are particularly ommited (Caplan, 1987).
Function words are the most connected ones. We suggest Caplan, D. (1987). Neurolinguistics and linguistic apha-
that such halts and lack of fluency are due to fragility as- siology. Cambridge University Press.

1
Accordint to our calculations, the 10 most connected words
are and,the, of, in, a, to, ’s, with, by and is.

5
Caplan, D. (1994). Language. structure, processing and Nowak, M. A., & Krakauer, D. C. (1999). The evolu-
disorders. The MIT Press. tion of language. Proc. Natl. Acd. Sci. USA, 96,
8028-8033.
Chomsky, N. (1957). Syntactic structures. Mouton.
Nowak, M. A., Plotkin, J. B., & Jansen, V. A. (2000).
Choueka, Y., & Lusignan, S. (1985). Disambiguation by
The evolution of syntactic communication. Nature,
short contexts. Computers and the Humanities, 19,
404, 495-498.
147-157.
Romaine, S. (1992). The evolution of linguistic com-
Deacon, T. W. (1997). The symbolic species: the co-
plexity in pidgin and creole languages. In J. A.
evolution of language and the brain. W. W. Norton
Hawkins & M. Gell-Mann (Eds.), The evolution of
& Company.
human languages (p. 213-238). Addison Wesley.
Ferrer, R., & Solé, R. V. (2000). Two regimes in the fre-
quency of words and the origin of complex lexicons. Scarborough, D. L., Cortese, C., & Scarborough, H. S.
Santa Fe Working Paper 00-12-068. (Submited to (1977). Frequency and repetition effects in lexical
the Journal of Quantitative Linguistics) memory. Journal of Experimental Psychology: Hu-
man Perception and Performance, 3 (1), 1-17.
Forster, K. I., & Chambers, S. M. (1973). Lexical access
and naming time. Journal of Verbal Learning and Simon, H. A. (1955). On a class of skew distribution
Verbal Behaviour, 12, 627-635. functions. Biometrika, 42, 425-440.

Hudson, R. (1984). Word grammar. B. Blackwell. Smith, J. M., & Szäthmáry, E. (1997). The major tran-
sitions in evolution. Oxford University Press.
Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N., & A.-
L.Barabási. (2000). The large-scale organization of Solé, R. V., & Montoya, J. M. (2000). Complexity and
metabolic networks. Nature, 407, 651-654. fragility in ecological networks. Santa Fe Working
Paper 00-11-060.
Kaplan, A. (1955). An experimental study of ambiguity
and context. Mechanical Translation, 2, 39-46. Strogatz, S. H. (2001). Exploring complex networks.
Nature, 410, 268-276.
Kempen, G., & Huijbers, P. (1983). The lexicalization
process in sentence production and naming: Indi- Watts, D. J. (1999). Small-worlds. Princeton University
rect election of words. Cognition, 14, 185-209. Press.
Li, W. (1992). Random texts exhibit zipf’s-law-like word Watts, D. J., & Strogatz, S. H. (1998). Collective dynam-
frequency distribution. IEEE Transactions on In- ics of ’small-world’ networks. Nature, 393, 440-442.
formation Theory, 38 (6).
Zipf, G. K. (1972). Human behaviour and the princi-
Melčuck, I. (1989). Dependencies grammar: Theory and ple of least effort. an introduction to human ecol-
practice. New York, University of New York. ogy. New York: Hafner reprint. (1st edition: Cam-
bridge, MA: Addison-Wesley, 1949)
Miller, G. A., & Gildea, P. M. (1987). How children learn
words. Scientific American, 257 (3).
Montoya, J. M., & Solé, R. V. (2000). Small world graph C Crandom d drandom
patterns in food webs. Santa Fe Working Paper ΩL (UWN) 0.687 1.55 · 10−4 2.63∗ 3.03
00-10-059. (Submitted to J. Theor. Biol.) ΩL (RWN) 0.437 1.55 · 10−4 2.67∗ 3.06
Newman, M. E. J. (2000). Models of the small-world.
Journal of Statistical Physics, 101 (3/4), 819-841. TABLE I. Word network patterns. It can be seen that
C  Crandom and d drandom , consistently with a SW net-
Nicolis, J. S. (1991). Chaos and information processing. work. All values are exact except those marked with ∗ , which
Singapore: World Scientific. have been estimated on a random subset of the vertices.

You might also like