Reprint from Molecular Strategie in Biological Evolution
Volume 870 ofthe Annals of the New York Academy of Sciences
‘May 18, 1959
The Linguistics of DNA: Words, Sentences,
Grammar, Phonetics, and Semantics
SUNGCHUL Jt
Department of Pharmacology and Toxicology, Rutgers University, Piscataway,
‘New Jersey 08855, USA
here are theoretical reasons to believe that bilosic systems and processes cannot be
fully accounted for in erms ofthe principles and laws of physics and chemisiry aloe,
but they require in addition the principles of semiorcs—the science of symbols and signs,
including linguistics." For convenience, we may refer tothe belief, common among con-
temporary molecular bilosiss, thatthe laws of physics and chemisty are necessary and
suiiient to account for life asthe PC (physics and chemistry) paradigm. while the aler-
native view that principles of semiotics are aditionally absolutely required fora complete
understanding of ving systems and processes as the PCS (physics, chemist, and semiot-
ies) paradigm.
Tt was von Neumann who first recognized the necessity for symbolic sel-represent-
tion of organisms asa prerequisite for eficient sel-epication In view of the fundamnen-
tal importance ofthis insight for biology, we may refer to this notion asthe von Neumann
doczrne, Tis doettne was further elaborated and developed by Pate into what may be
the theory of matter-symbo! complementarity ® The linguistic theory of DNA pre-
sented here canbe viewed as a natural extension 1 the structure and function of DNA of
the von Neumann doctrine and Pates's theory of mastersynbol complementary. (See
Note Added in Proof)
Since the discovery ofthe DNA double helix in 1953, many biologists have employed
Janguage as a useful metaphor to describe certain aspects of molecular biologie phenom
ena”? But recently it was postulate that language is more than just a metaphor and that
linguistics provides a fundamental principle to account for the structure and function of
the cell This conclusion is supported by the facs (1) that cells wea language, called cel
language or cellese, defined as “a sol-organizing system of molecules, some of which
encode, act assigns fr, or trigger, gene-directed cll processes," and (2) that xl language
has molecular counterparts to 10 ofthe 13 design features of human language (humanese)
characterized by Hockett and Lyon, thus suggesting an isomorphism between cellese and
hhumanese"™" Because cellse must be tansmited from one generation tothe next. it
must be encoded in DNA. Therefore, the man objective of this communication isto cha
acterize the structure and function of DNA based on linguistic principles,
ISOMORPHISM BETWEEN CELL AND HUMAN LANGUAGES
Both human and cell languages can be treated as a 6-uple (L, WS, G, BM), where L
isthe alphabet (i. ase of basic symbols called provosemaa'), Wis the vocabulary or
lexicon ie, set of words), Sis an arbitrary set of sentences, Gis a set ofthe rules g0v-
ering the formation of sentences from word (the first articulation) as wel as the forma-
ana ‘ANNALS NEW YORK ACADEMY OF SCIENCES
tion of words from letters (the second articulation), P is a set of physical mechanisms
realizing and implementing a language, and finally M is a set of objects (both symbolic
‘and material) or processes refered to by words and sentences, TABLE I summarizes a con
parison between sound-based and visual signal-based human language and molecule-
‘based cell language with respect to these categories of linguistic features. The table is self-
‘explanatory, and newly appearing terms are explained inthe accompanying footnotes. The
isomorphism between cell and human languages evident in TaBte 1 suggests the existence
‘of three distinct categories of genetic information in DNA here called the lexical, symtac-
fic, and semantic, To visualize the elation among these three categories of genetic infor
"TABLE 1. Comparison between Human snd Cell Languages
Human Language Call Language
TAlphaber (2) —_Leters “F Nleotides (er 29 amino acid)
2. Lesion (W) Words Setar gnes (or polypeptides)
3.Semences(S) Stings of words Sets of genes expressed coorinatly in space
‘od time under the contol of
spatiotemporal genes"
4.Grammar(G) Rules ofeatence formation Lave of chemisty and physics of alee
20s thar determine the folding pater of
DNA according to ncleotie sequences
and microenironmentalsanditios. Only
‘sal suet of grammatically folded
(ence sraactically cone! chromatin
structures is selected by evelution and
hence cary genetic (Le, semantic)
information,
'5.Phonetcs(P) Physiologie structures and Conformational dynamics of DNA that
Process underlying ‘enables the expression of genetic
‘honation audition, nd information tough input of fee encrey
fncerptation ia protein binding andor ATP-dependent
super coling of DNA
6, Semantics M) Meaning of words and Gene-rected ell process driven by
Senteaces ‘onformons and intracellular isiatve
sructures (DSS
7First Aricultion Formation of sentences trom Organization of gene expression in space and
‘words time (oh noncavalent interactions")
8 Second Trmaton of words om Orpaizaion of nucleotides (amino side) into
‘Anieulaion eters s20es (polypeptides) (lrouph covalent
Imeractiont)
“Genes tat consol the spatiotemporal cvohaton of the expression of Mricural Genes by
regulating the tine- and space-dependen folding patterns of chromosomes.”
‘Conformational tains of biopolymers tha cary free energy (Wo do work) and infomation (10
contol work)?
“Dissipative sructares of Prigoine (or atractors) localized within the ell?
“Molecular inersetons that do not mplicate any breaking or forming of covaent bonds,
“Molecular iterations that involve changes in covalent bonds, namely, alterations in valence
lectonie configurations.JI: LINGUISTICS OF DNA : a3
mation, it is convenient to use a loaded carousel as a metaphor for DNA with the
alignment as shown in Tan 2.
Just asa grammar constrains mentally the word order in sentences, so a carousel con-
strains physically the positioning of slides ino a linear array, any linear array. The genetic
analog of this constraint is referred to as the syntactic genetic code identified with the
physicochemical constraints of nucleic acids that control the folding patterns of chrom-
tins in esponse to microenvironmental conditions such as the presence of transcription
factors, pH, fons, and mechanical stresses of nuclear scaffolding. Please noe tat there are
a large number (i. n!) of arranging a slides into slots ina carousel, But only one, or at
‘most afew, of these linear arrays will actually be utilized by a given speaker. The informa
tion needed (logs n! bits) to select these few arrangements out of the large possible
arrangements derives from the brain of the speaker. But inthe case of DNA, the informa
tion determining the temporal order in which a se of genes is expressed must be encoded
in DNA itself (in the form of semantic genetic code)—in regions that were previously
called spatiotemporal genes and postulated to be located in noncoding DNA.™! In the
Ihuman genome, structural genes account for approximately 3% of the total DNA mass,
whereas the remaining 97% of DNA is noncoding and was once thought to be without any
biologic function. But impressive amounts of empirical data were recently accumulated in
the literature, indicating that noncoding regions, particularly “repetitive sequences,” play
an important role in genetic control processes.” Consistent with these developments itis
postulated here that these noncoding regions regulate the spatiotemporal evolution of the
expression of structural genes and thus contain genetic information analogous to the
semantic information of sentenees. The genetic information that determines the spatiotem:
poral organization of gene expression is referred to as "semantic genetic code.” It is
‘thought that semantic genetic information isa subset of syntactic genetic information, just
as semantically meaningful sentences constitute but a small subset of grammatically cor-
rect sentences in human language. The syntactic genetic information is distributed over the
‘whole DNA molecule in that every aspect ofthe physics and chemistry of DNA affects the
‘dynamics of DNA. Therefor, the sum ofall the genetic information encoded in DNA is
200% (Taste 2). This makes sense only if we can assume that DNA structures encode
‘more than one kind of information within identical sequences and that different kinds of
‘TADLE 2. The “Loaded Carousel” Model ef DNA Stueture and Fupetion
Genetic Coe
Mole Repetition DNA mas vied)
‘Structural genes in cong DNA oe
Carousel Suearpospate bacttone, Wasoo- _Syacte gens code (1008)
‘Gk bse ping, hems a
pais ofDNA.
Onder ofsdes Spe andtinedepndcat gene Semantic enc code 07)
expression, mde possibe by space:
tnd time-zopenentfldings of
‘Somalis exposing ight genes at
‘ight times, ll regulated by
Spatiotemporal genes located in
encoding DNAas ANNALS NEW YORK ACADEMY OF SCIENCES
‘genetic information can overlap in DNA, in agreement with the multiple genetic code
hypothesis of Trifonov." The present result is also consent with the view that DNA pos-
‘esses dual or complementary aspeets—dynamic and semiotic, ot material and symbolic."
‘The syntactic genetic code represents the dynamic or material aspect of DNA obeying the
laws of physics and chemistry, while lexical and semantic genetic codes constitute the
symbolic (orsign) aspect that obeys the rues forged by biologic evolution. This interpre
ation fits nicely with the notion ofthe mattr-symbol complementarity (or more generally
‘matter-sign complementarity; see Note Added in Proof) as the most fundamental distin-
guishing feature of biology vis-2-vis physics and chemistry."
Indirect evidence for the existence of spatiotemporal genes (carrying semantic
‘zenetic code) was recently provided by Amano et a.'® Their data from Figure 8 ean be
replotted in a graph ofthe percentage of noncoding bases per genome versus the relative
‘amount of structural genes in the form of transcription factors per genome to obtain two
lines, one with a zero slope passing through five species of unicellular organisms (Syco-
plasma genitalium, Haemophilus influenca, Methanococcus jannaschii, Synechocystis
sp, and Escherichia coli) and the other with a slope of sbout 20 passing through three
species (Saccharomyces cerevisiae, Caenorhabditis elegans, and Homo sapiens), two of
Which ate multicellular organisms. Interestingly, these lines intersect in the neighbor-
hood of & coli and S, cerevisiae. Two conclusions may be drawn from ths plot (1) The
amount of noncoding DNA increases abruptly with the multiceliularty of organisms
(most likely due tothe fact that noncoding regions act as “spatiotemporal genes” regu
lating the development of multicellular organisms), and (2) Of the two mechanisms for
‘regulating gene expression—rans mechanisms mediated by transcription factors and cis
‘mechanisms mediated by noncoding regions, the latter contributing to a greater extent
(ve tothe slope being greater than 1) than the former as the complexity of multicelhular
‘organisms inreases.
‘The role of noncoding DNA strongly suggested by these data is dificult to be accom-
‘modated by the traditional view that the final referents oe meaning of genes are polypep-
tides. However, the data are consistent with the so-called DNA:polypeptide-IDS
hypothesis” which claims thatthe final products of genes (i, structural genes under the
control of spatiotemporal genes) are not polypeptides but dynamic processes collectively
called intracelluar dissipative siructures (IDSs) whose generation is catalyzed by
enzymes encoded in structural gene. IDSs include ionic gradients in the cytosol or across
Diomembranes, and mecanical stess gradients in biopolymers including eytoskeletons
and DNA, all of which together act as the proximal or immediate causes for cell func
tions.’ (ee aso Pa. 1).
‘According to some linguists, the phenomenon of double articulation or duality (see
seventh and eighth rows in Tabte 1) isthe most fundamental aspect ofall human lan-
guages. The cell language theory is based on the basic assumption thatthe cell-linguistic
counterpart of double articulation isthe duality of covalent and non-covalent interac
‘ions in the cel. Just as the first and second articulations are both essential in human lan-
‘guage, so it is postulated that both covalent and conformational iterations are
Fundamental in cell anguage (enabling intercellular communication and signal trans
duction). This postulate appears to provide the first explicit rationale forthe fundamen
tal role of conformational interactions in molecular biology, as observed in ligand-
protein interactions, protein foldings, and chromatin reorganizations during the cell
coyelei LINGUISTICS OF DNA
i
ve \
\ ® ecusiom 39 ster)
oy £20 a
ee wf seg
FIGURE 1. The Bhopastor, a molecular model of th lving cell Te cells molecular machine in
that its moving pats are made out of molecules, some of which act a molceular matory dven by
conformational stains (called conformons) scserated from chemical reactions cr ligan nding
‘eactions. The final form of expression of genes snot polypepide, a wally thought bt dpa
five structures of Prigogine (ot aac) (ee the recengl) namely gradients hema concen
tations aid mechanical steses in the cel These dissipative strctres at asthe dct causes far
alleel futons. Solid arrows indicate the detion of information flow-—tom DNA to miRNA 10
proteins o dsipative structures af rigogine (aso called intracllolar dissipative wructurs, of
105s), and back to DNA. Dotted ars indicate Feedback interactions. The cel receives apts fom
its rounding (Step 19) an process it according to the genetic information stored in DNA (Steps
5-11 and 1-4) and outputs the result (Step 20), Because IDS's ean influence the rae of mations
and recombinations of DNA (Step 10), DNA. can guide its own evolution. Thali, DNA is self
‘volving molecule driven by conformons and IDSs, For mee nfonnation, sce reference 9 patch
lay 178.
‘THE BHOPALATOR: A MOLECULAR MACHINE THAT ACCEPTS.
‘CELL LANGUAGE
Since the founding of the cell theory inthe mid-nineteenth century, there hed boon no
rigorous and comprehensive theoretic mode! of the living cell available in the literature
‘until 1983, when the Bhopalstor model of the cell was proposed in a meeting held in Bho-416 ANNALS NEW YORK ACADEMY OF SCIENCES
pal, India The name ofthe model reflects the convention that mechenisms of seif-orga
nizing chemical reaction difusion systems are reamed 36 "X-atos,” where Xs the name of
‘city connected in some way with the model. Two concepts are novel in this model—(1)
‘conformons, sequence-specific conformational strains of biopolymers that provide free
energy and cootrl information fr diving all molecular motors inthe cel,’ and (2) intra
cellular dissipative structures (IDSs), geadients of chemical concentrations and mechani-
‘al siresses in the cell that mediate information transfer from the aucleus to the cytosol and
{rom the eytosol to the extracellular space.” With conformons and IDSs (synonymous with
“auractors”), the cell cannot only “read” genetic messages encoded in DNA but also
“implement” and “reify” these messages into molecular processes and actions constituting
cell functions. In other words, the Bhopalator can be viewed as a molecular machine that
‘ccepis cell language encoded in DNA, just as the Turing machine acts as an abstract
‘machine that accepts and defines a formal language encoded on a tape. Iti to be noted
‘that these molecular entities that drive the cell, namely conformons and IDS's, can be
‘viewed as the microscopic embodiments of the matter-symmbol complementarity discussed
bby Pattee."
PREDICTIONS:
1. The cel language theory predicts that DNA of higher eukaryotes contains two kinds
of genes: structural genes located in coding regions (accounting for ~3% of the human
‘genomic mass) and spatiotemporal genes located in noncoding regions (-97% of the
human genomic mas),
2. Spatiotemporal genes encode the information controlling the timing of gene expres-
3. The timing information encoded in spatiotemporal genesis retrieved through space-
and time-dependent chromatin folding and unfolding processes driven by ATP-dependent
‘opoisomerases and free energy-rleasing binding interactions between transcription fac-
tors and DNA,
CONCLUSION
‘The cell language theory and the Bhopalator model of the living cell provide the first
‘comprehensive and rigorous theoretic framework for molecular and cell biology. As such,
‘they may find important applications in functional genomics in the coming decades.
[Nove apne 1s pxoor: It was recently suggested elsewhere (S. Ji, “The cell as the
‘smallest DNA-based molecular computer” BioSystems, in press) thatthe ideas of J. von
Neumann (ie. the necessity of self-representation for seli-teproduction) and H. Pattee
‘namely, the matter-symbol complementarity as an essential feature of ll sel-eproducing
systems) be combined into what may be called the “von Newnann-Patte principle of mat-
tersign complementarity” The term ‘symbol is replaced with the more general term,
‘sign, since according to CS. Peirce (1839-1914), signs include symbols along with
icons and indexes (J.J. Liszka, “A General Induction o the Semeiotc of Charles Sand=
ets Peirce, Indiana University Press, Bloomington, 1996). The essential content of the vorII: LINGUISTICS OF DNA 47
Neumann-Patte principle of maver-sign complementarity is that all sef-reproducing $yS-
tems embody two complementary aspects—the physical law-governed materiaVenergetic
aspect and the evolutionary rul-governet sign aspects. The dual role of DNA revealed in
‘Table 2, namely the fact that [syntactic (100%)] + [lexical (39%) + semantic (979)] =
200%, finds «rigorous theoretical rationale in the von Neumann-Pattee principle of the
‘matter (syntacti)-sign (lexical & semantic) complementarity]
REFERENCES,
1, Pass, HLHL 1968. The physical basis of coting and reliability in biological evolution. fn
“Towards a Theoretical Biolog. |. Prolegomena. CH. Waddington, Ed 67-93. Aldine Pub-
lishing Co. Chicago.
2. Perm, HH. 1970. The problem of biological hierarchy. Jn Towards a Theoretical Biology. 3.
Drafls.CH. Waddington, Ed 117-136. Aldine Publishing. Chicago.
3. Perms, Hil. 1972, Laws and constrains, symbols and languages. In Towards a Theoretical
‘Biology. 4 Essays. CH. Wadditon, Ed.” 248-258, Edinburgh University Press. Einar
4. Vox Neth, J. 1966. Theory of Self Reproducing Automata. A.W. Burks, Bd: 122-123. Uai-
Versi of Minos Pres. Uta
5, Parr, HL. 1982, Cell Psychology: An Evolutionary Approsch tothe Symbol Matter Prob-
‘em. Cog. Brain Theor §:325-341
6, Parmin, HI, 1995, Evolving Seif-Reference: Mater, Symbols, and Serande Closure, Inte.
‘Study Arie, Intll, Cogn. Sei. Appl. Epstemo. 12: 9-77
17, Seamso, ML 1991. Four Analogies Betseen Biological abd CultualLinguistc Evoluon. 3
"Theoret, Biol. 151: 467-507,
8, GasciBauuioo, A. 1984, Towards a Genctic Grammar. An English version of “Hacia una
‘Granuica Gensica,” Real Acadeaia de Ciencias Exacas, isis y Naturales,
9. 8, $1991 Biocybereies: A Machine Theory of Biology. In Molecular Theories of Cell Life
‘and Deas, S11, Ed: 1-237, Rutgers Universiy res. New Brunswick.
10, 4,8. 1997. Isomerphism between cell and human languages: Molecular biological, bioinfor-
‘matic and linistc implications. BioSystems 4: 17-89,
11, J, 5.1998, Cell Language (Cells): Implications for Biology, Linguistics and Philosophy.
“international Workshop om the Linguistics of Biology a the Biology of Language, CHEN,
Universidad Nacional Auténoma de Mexico, Cuemsvaca, Mésico, March 23-27, Fr abstract,
see hp:fwwrcfn nam mComputationl_Biology998l
12, Mateus S. 1967. Algebraic Linguistics; Analycal Models. Academic Press. New York
15, Braioucnanl SK, G. Mines, Ps, Saskate P-Baisctnewoonrny, J. Tapas, 8 Rasta, U.
Stauioran & S. Parasia 1995. Simple repetitive sequences inthe genome: Stecture and
funcional significance. Electophoresis 16: 1708-1714
14, Thurow, EN, 1989, The multiple codes of nucleotide sequences. Bil. Math, Bio. 1: 417
2,
15. Asano, N.Y. Ouruks de M, Suk, 1997, Genomes and DNA conformation. Bil, Chem. 378:
1397-1404.