Prosodic Studies - Challenges and Prospects
Prosodic Studies - Challenges and Prospects
Prosodic Studies - Challenges and Prospects
Prosodic Studies
Mandarin Loanwords
Tae Eun Kim
Prosodic Studies
Challenges and Prospects
Edited by Hongming Zhang and Youyong Qian
Prosodic Studies
Challenges and Prospects
Introduction 1
Prosodic hierarchy 7
Prosodic patterns 109
vi Contents
5 A prosodic essence conjecture 141
Interface between prosody and syntax/morphology 225
Prosody in language acquisition 315
x Tables
6.10 A’ value for each group 177
6.11 ANOVA table of A’value for each group 178
6.12 A t-test for accuracy rate and reaction time concerning
glottalization (Allotone 7) 179
6.13 A t-test for accuracy rate and reaction time concerning
glottalization (Allotone 8) 180
6.14 A t-test for accuracy rate and reaction time concerning
duration (Allotone 7) 180
6.15 A t-test for accuracy rate and reaction time concerning
duration (Allotone 8) 181
6.16 Models chosen for each tone and estimated coefficients 188
6.17 Quantiles based on fitted models and transformation to
tone letters 188
6.18 Phonological representations based on acoustic values 189
7.1 The value of citation tones and sandhi tones in SHC 201
7.2 Stimulus sentences 202
7.3 An example of discourse contexts 203
7.4 The description of the tonal realization of S1+S2 compound
and S3+S4 phrase 206
7.5 The effects of contrastive focus on the maxf0 and minf0 of
each syllable 210
7.6 The effects of contrastive focus on the rhyme duration and
mean intensity of each syllable 211
12.1 Means (n = 5) and standard deviations (SD) of the ages of
male and female children of nine age groups from 4 to 12 years 333
12.2 The mean F0 values (in Hz) of the Cantonese tones [55 33 22]
for male and female children at 4 to 12 years of age and
adults in early 20s 337
12.3 Ratios of the mean F0 values (in Hz) of the Cantonese tones
[55 33 22] for children at 4 to 12 years of age to those for
adults in early 20s of the same gender 337
12.4 F0 ratios of females to males of each of the age groups,
4–12 and early 20s, for the Cantonese tones [55 33 22] 338
12.5 F0 ratios of the Cantonese tones [55] to [33] and [33] to [22]
for male and female children at 4 to 12 years of age and
adults in early 20s 339
12.6 F0 ratios of children to adults of the same gender for
Cantonese and English 341
12.7 F0 ratios of females to males of the same age group for
Cantonese and English 342
13.1 Potential influence of anticipatory coarticulation on T2
and T4 accuracy rates 350
13.2 Error patterns with positional information 355
13.3 Substitutions with positional information 356
13.4 Average F0 values of T2 offsets in correct productions 358
Tables xi
13.5 Average F0 values of T4 onsets in correct productions 359
13.6 The top three disyllabic response tones for target T2 (LH)
at initial positions 360
13.7 Statistical analyses of error type comparisons for T2-T1,
T2-T4, T4-T1, and T4-T4 361
Si Chen received her PhD in linguistics and her MS in statistics from the
University of Florida in 2014. She works on statistical modeling of speech
production, perception and their relationship, as well as applications in
speech training and speech therapy. She has developed statistical models
to solve challenging problems in phonology and simulate the human
perception process in extracting linguistic information from varied
speech signals of tones. Her publications include Chen, Si, Caicai Zhang,
Adam McCollum and Ratree Wayland (2017) “Statistical Modeling of
Phonetic and Phonologised Perturbation Effects in Tonal and Non-Tonal
Languages,” Speech Communication, 88, pp. 17–38.
San Duanmu is Professor of Linguistics, University of Michigan. He received
his PhD in Linguistics from the Massachusetts Institute of Technology in
1990 and has held teaching posts at Fudan University, Shanghai (1981–
1986) and the University of Michigan, Ann Arbor (1991–present). His
research focuses on general properties of language, especially those in
Jun Gao is Associate Professor of the Institute of Linguistics, Chinese
Academy of Social Sciences. Her research interests are phonological devel-
opment, infant speech perception, and children’s speech production. She
has published Shi, R., Gao, J., Achim, A., and Li, A. (2017) “Perception
and representation of lexical tones in native Mandarin-learning infants
and toddlers,” Frontiers in Psychology, 8, p. 1117.
Carlos Gussenhoven is Professor Emeritus of General and Experimental
Phonology at Radboud University (Nijmegen, the Netherlands). He has
analyzed the phonologies of a number of languages, with a special orien-
tation on prosody. He has published The phonology of tone and intonation
(2004, Cambridge University Press) and coauthored Understanding phon-
ology (1998, 4th edition 2017, Routledge).
Judith Hanssen obtained her PhD from Radboud University (Nijmegen,
the Netherlands) in 2017. She specializes in phonetic and phonological
variation in (dialect) intonation, which resulted in a dissertation entitled
List of contributors xv
Hongming Zhang is Professor and Head of the Chinese Language &
Linguistics Program at the University of Wisconsin-Madison. He is also
executive editor of International Journal of Chinese Linguistics, series
editor of Routledge Studies in Chinese Linguistics, and editor of the
volume Phonology and Poetic Prosody of The Encyclopedia of China (3rd
edition). His recent published books include Syntax-phonology inter-
face: Argumentation from tone Sandhi in Chinese dialects and Tonal prosody
in Yongming style poems.
Hongming Zhang
2 Hongming Zhang
also challenged by a large number of counterexamples in various languages,
and thus are subject to revision and updating. We begin with three chapters
in Part I to discuss this topic. Irene Vogel discusses the challenge of how to
constrain prosodic structure in the absence of the Strict Layer Hypothesis
(SLH), focusing on the Composite Prosodic Model, which includes a distinct
constituent between the phonological word and the phonological phrase –
the composite group. She thus offers a more nuanced model of the prosodic
hierarchy that recognizes three different sub-parts according to the nature of
their interface with other grammatical components. San Duanmu examines
syllabification and stress in English, and shows that it is possible to compare
current analyses in a consistent way and determine which ones fare better.
Specifically, he shows that the Law of Initials, the Law of Finals, and Max
Onset can all be satisfied at the same time (yielding Revised Max Onset). He
also claims that syllabification and stress can be evaluated simultaneously,
rather than being sequentially ordered. The resulting analysis yields con-
sistently good foot structures, less violation of the Weight-Stress Principle,
and higher percentages of correct predictions of main stress. Shuxiang You
analyzes clitics and the clitic group in Fuzhou Chinese. A thorough study of
the relevant data in Fuzhou, from the perspectives of both morphosyntactic
functions and phonological behavior, reveals that clitics in this dialect share
some common morphosyntactic and phonological properties with clitics in
other languages. Although enclitics and proclitics in Fuzhou Chinese show
asymmetries in terms of their phonological behavior, the clitic group as a
whole has very peculiar phonological behavior as compared to lexical items
and phrases, which provides motivation and evidence for the establishment of
the clitic group domain in this dialect. Moreover, Fuzhou clitics may attach
to constituents higher than the prosodic word, which constitutes a great
challenge to the Strict Layer Hypothesis.
The topic of Part II focuses on prosodic patterns. Along with the fast
development of IT and computer science, many studies have adopted an
experimental and computational approach in prosody research. Scholars have
shown increasing interest in examining the acoustic parameters associated with
prosodic phenomena. The results of these studies are widely applied in multi-
media communication, including text to speech, speech recognition, speech
synthesis, and so on. There are four chapters that contribute to Part II. Judith
Hanssen, Carlos Gussenhoven, and Jörg Peters look at additional data from
a project to see whether we can replicate the finding of a geographical cline
in the realization of non-final nuclear falling contours, and whether it is also
found for IP-final nuclear contours. They discuss the effect of Dialect on the
phonetic realization of contours, as opposed to effects of time pressure, focus,
or word boundary location. They report dialectal differences in segmental
duration as well as tonal timing, pitch excursion, pitch slope, and overall
pitch level. It is well known that, compared to non-final falls, final falls may
be realized with longer segmental durations, earlier nuclear peaks, or steeper
or shorter falling excursions. Lian-Hee Wee proposes a Prosodic Essence
Introduction 3
Conjecture (PEC), which implies a new perspective on language typology in
place of tradition notions of tone versus stress languages. A corollary of PEC
is that tone and accent are phonetically the same; thus, prosodic principles of
meter (such as minimum word requirements) would be universal. PEC rules
out prosodic contrasts where length or intensity is used without allowing
pitch. PEC does not supersede typology derived from prosodic marking at
different levels (syllable, word, phrase). Si Chen argues for statistical modeling
of phonetic data in providing a phonological representation of tones using
Chao’s letters or the L, M, and H representations. The chapter first focuses on
phonetic examinations of several phonetic cues subject to a perceptual experi-
ment. Then, the perceptual study shows that other cues found in the phonetic
examination do not contribute significantly to the discrimination of allotone
pairs after voiced versus voiceless onset, and that F0 contours are sufficient
in discriminating those allotone pairs without onset consonants. The F0
contours are statistically modeled, and the underlying pitch targets statistic-
ally tested to be quadratic correspond well to record in the fieldwork. The
fitted values obtained from the optimal model were calculated, and sample
quantiles are obtained. The final representations provide similar basic tonal
shapes with some differences in the exact integers used for the onset, turning
point, and offset. This method provides a representation more consistent with
normalized phonetic F0 values, taking the perceptual aspect into consider-
ation. Bijun Ling and Jie Liang focus on the acoustic realization of focus
and lexical tones in Shanghai Chinese, a word-tone language. This was done
through an investigation of F0 and durational adjustment of disyllabic words
in short sentences.
Three chapters in Part III are about the interface between morphosyntax
and prosody. It has been widely observed that phonological structure is sen-
sitive to morphosyntactic structure, but what elements of phonological struc-
ture and how the phonology are influenced by morphosyntactic structure are
still open to debate (Kaisse 1985). Ellen Kaisse reports on an initial survey of
processes in the phonological literature described as applying across words,
and she speculates on why postlexical application is so strongly skewed toward
certain kinds of processes and not others. Junko Ito and Armin Mester show
that the recursion-based conception within Match Theory allows for a con-
ceptually and empirically cleaner understanding of the phonological facts and
generalizations in Japanese as well as for an understanding of the respective
roles of syntax and phonology in determining prosodic constituent struc-
ture organization, and the limitation in types of distinctions in the prosodic
category that are made in phonological representation. Hongming Zhang
discusses some interface issues through case studies of Xiamen Chinese and
Pingyao Chinese, and tries to prove that the Optimality Theory (OT) fails to
capture the nature of tone sandhi in the cases of both Xiamen and Pingyao by
brutal force or ad-hoc constraints, and that the interface theory under the OT
framework does not have explanatory power superior to that of the theory
proposed before the OT era.
4 Hongming Zhang
The chapters in Part IV by three contributors study the prosody in lan-
guage acquisition. More and more studies on prosodic properties have been
conducted in the field of first and second language acquisition. There is an
emerging interest in the following questions: When and how do infants acquire
prosodic information? What is the difference between the prosody of native
and non-native speech? How are prosodic characteristics of second language
speech related to the degree of foreign accent? Jun Gao and Rushen Shi pre-
sent their empirical findings on infants’ perception of lexical tones during the
first year of life. The findings shed light on the mechanisms of first language
acquisition, in which input-independent capacities and input-guided learning
both play a role. Wai-Sum Lee analyzes F0 (pitch) development in Cantonese
pre-adolescent children, male and female, aged 4–12 years. Her main findings
include (i) a progressive F0 decrease as age increases, (ii) a large F0 drop at age
12 in male children, indicating the onset of adolescent voice change, (iii) no
significant F0 difference between female children at age 12 and female adults,
indicating the end of female adolescent voice change, and (iv) no apparent
gender distinction in voice until age 12. Hang Zhang investigates the errors
made by 60 English, Japanese, and Korean speakers learning Chinese when
producing the two contour lexical tones T2 (rising tone) and T4 (falling tone).
This study finds that T2 is produced at a greater rate of accuracy in word-
initial positions, while T4 is produced at a greater rate of accuracy in word-
final positions. This study also finds two intertonal effects shared across the
three groups of speakers: (a) the accuracy rate of T4 is always greater when it
is followed by low tones than when it is followed by other tones, and (b) the
accuracy rate of T2 is always greater when it is followed by tones with low
onsets than when it is followed by tones with high onsets. Findings suggest
that second language tones are constrained by the cross-linguistically common
phonetic mechanism of anticipatory dissimilation.
To conclude, this volume, as a reflection of current prosodic studies, is
not only worth reading for scholars who are interested in prosody but also
for theoretical linguists, psycholinguists, and scholars investigating language
In planning this project, I had two criteria in mind: broad coverage and
balanced perspectives. It is gratifying to note that the finished chapters have
come together as planned. A good range of topics –prosodic hierarchy,
prosodic patterns, interface between prosody and syntax/morphology, and
the prosody in language acquisition –is covered. Our chapters also reflect
a balanced participation by Western and Eastern scholars, as well as by
phoneticians and phonologists. The approaches employed, too, display a
balance between empirical analysis and theoretical inquiry. It is our hope that
this volume will draw more scholarly attention to the prosodic studies in the
field of both Chinese linguistics and Western linguistics.
Finally, I would like to express my deep gratitude to Tianjin Normal
University, Nankai University, Tianjin Foreign Studies University, the
Editorial Office of Contemporary Linguistics of the Chinese Academy of
Introduction 5
Social Sciences (CASS), and Key Lab of Phonetics and Speech Science of
CASS for funding the international conference “Prosodic Studies: Challenges
and Prospects” in June 2015. I also wish to thank my co-editor, Youyong
Qian, who generously gave his time to help with the editing of this volume,
competently handled all technical and clerical matters, and acted as liaison
with the press and individual contributors.
Kaisse, E. M. (1985) Connected speech: The interaction of syntax and phonology.
New York; San Diego: Academic Press.
Nespor, M., and Vogel, I. (1986) Prosodic phonology. Dordrecht: Foris.
Nespor, M., and Vogel, I. (2007) Prosodic phonology: With a new foreword.
Berlin: Mouton de Gruyter.
Selkirk, E. (1984) Phonology and syntax: The relation between sound and structure.
Cambridge, MA: MIT Press.
Selkirk, E. (1986) “On derived domain in sentence phonology”, Phonology Yearbook,
3, pp. 371–405.
Part I
Prosodic hierarchy
Life after the Strict Layer Hypothesis
Prosodic structure geometry1
Irene Vogel
1.1 Introduction
Although Pāṇini studied phonological phenomena that apply across different
types of junctures (i.e., word-internal and word-external sandhi phenomena)
over 2,000 years ago, it is only in the last few decades that we have seen a sub-
stantial rekindling of interest in juncture phenomena in modern linguistics.
For example, different types of junctures were directly encoded by different
boundary types in Sound Pattern of English (SPE)-type phonological analyses
and implicitly encoded in the levels of lexical phonology. Most recently, pros-
odic phonology has provided a means of addressing the different domains of
application of phonological phenomena in terms of phonological or prosodic
constituents that are mapped from morphosyntactic structures, but which
might or might not be isomorphic to those structures.
While the details of the number and nature of the constituents vary to
some extent across analyses, a core principle in early models of prosodic hier-
archies (e.g., Nespor and Vogel 1986/2007; henceforth N&V) was the so-called
Strict Layer Hypothesis (SLH). The SLH served to significantly restrict the
geometry of prosodic hierarchies by requiring that a constituent of a par-
ticular level (Cn) dominate only constituents of the immediately lower level
(Cn-1); however, it was soon realized that the SLH was too restrictive and thus
had some undesirable consequences. This chapter examines the implications
of the SLH and considers proposals to weaken it in order to overcome the
drawbacks. It will be demonstrated that the two main components of such
proposals, allowing levels to be skipped in the prosodic hierarchy and the
introduction of recursive constituents, while resolving some problems, also
introduce new complications. In fact, this is not surprising since weakening
strong limitations on any system, and the prosodic hierarchy is no exception,
will automatically increase the options within that system. The challenge then
becomes how to limit the newly available structures to avoid excessive and
otherwise undesirable options.
Three recent proposals for re-constraining prosodic structure in the absence
of the SLH are assessed with regard to their adequacy in constraining pros-
odic structure geometry as well as their success in accounting for a range of
phonological phenomena. First, Match Theory (e.g., Selkirk 2011) and the
10 Irene Vogel
Adjunction Approach (e.g., Itô and Mester 2009a, b), both of which exclude
a constituent between the phonological word and phonological phrase, but
admit recursive constituents, are examined and shown to have a number of
drawbacks with respect to constraining the prosodic hierarchy as well as
accounting for certain types of phonological phenomena. An alternative pro-
posal, the Composite Prosody Model, is advanced and shown to overcome
fundamental problems in the other approaches.
Crucially, the Composite Prosody Model includes an explicitly defined
prosodic constituent between the phonological word and the phonological
phrase, the composite group (roughly similar to the previous clitic group
(CG)). It is demonstrated that it is specifically the inclusion of this constituent
that allows us to avoid a number of the drawbacks of the other prosodic
models, permitting the formulation of small set of strong restrictions on the
general architecture of the prosodic hierarchy as well as providing straightfor-
ward analyses of a range of phonological phenomena in different languages.
The Composite Prosody Model, moreover, recognizes a three-way distinc-
tion among sets of prosodic constituents within the prosodic hierarchy based
on the nature of their interface with other components of grammar (syntax,
morphology, or no interface), but it also provides the means of unifying the
different sets of constituents through the formulation of a small number of
principles that govern the overall geometry of the prosodic hierarchy.
Specifically, in Section 1.2, the role of the SLH in prosodic phonology is
reviewed, and its problems, as well as its contributions, are considered. Then
Section 1.3 discusses the main proposals for weakening the SLH, focusing on
skipping levels in the prosodic hierarchy and recursion. Since weakening the SLH
introduced a number of new drawbacks, recent approaches to addressing these
problems are considered in Sections 1.4 and 1.5. In the former, models without
a constituent between the phonological word and phonological phrase (Match
Theory, Adjunction Approach) are examined, and in the latter, the Composite
Prosody Model, with the intervening composite group, is examined. Section
1.6 synthesizes the different types of considerations addressed in the preceding
sections and addresses the question of whether the differences among the various
constituents mean that it is not feasible to maintain a single prosodic hierarchy.
It is argued that the Composite Prosody Model does, in fact, provide the means
of unifying the prosodic hierarchy, while also recognizing important differences
among the constituent levels. Finally, Section 1.7 offers general conclusions.
1.2 The prosodic hierarchy and the role of the Strict Layer
Phonological Utterance ( )
Intonational Phrase ( )
Interface with other
components of
Phonological Phrase ( )
| grammar
Clitic / Composite Group (CG)
Phonological Word ( )
Foot ( ) No interface with
| other components of
Syllable ( ) grammar
phonological, or prosodic, constituents that are related to, but not necessarily
identical to, syntactic structures.3 The approach was extended to mismatches
with morphological structure, and a combined hierarchy of the different types
of prosodic constituents was developed. The hierarchy was then sometimes
further extended to include smaller phonological structures consisting of more
than a single segment. An early model that incorporates these various types
of components is that presented in Nespor and Vogel (1986, 2007), shown in
Figure 1.1.4
Developments in prosodic phonology have included some differences
in the constituents of the hierarchy as well as the principles by which the
constituents are constructed.5 The constituent most commonly excluded from
the prosodic hierarchy is the CG, for reasons discussed below. The phono-
logical utterance is also frequently absent, generally because investigations
tend not to focus on phenomena with such large domains; however, in Match
Theory, it has to some extent been supplanted by a recursive intonational
phrase (Selkirk 2011).6 In other models, constituents are excluded on a case-
by-case basis; for example, it has been proposed by Schiering et al. (2010)
that in Vietnamese there are no constituents between the syllable and the
phonological phrase.
In some analyses, we also find proposals for additional or slightly
different constituents such as accentual and intermediate phrases, or roughly
corresponding major and minor phrases (among others, Beckman and
Pierrehumbert 1986; Elordieta 1997, 2007; Itô and Mester 2007, 2009a, 2012;
Jun 1998, 2005a (for overview); Selkirk et al. 2003; Selkirk and Tateishi 1988;
Shinya et al. 2004; Venditti 2005). The prosodic stem has also been proposed
as a constituent in the hierarchy, most notably for Bantu (e.g., Downing 1999;
Jones 2011) and Salish languages (e.g., Czaykowska-Higgins and Kinkade
1998). A number of so-called recursive constituents have been introduced as
well. These are most commonly found at the phonological word level (among
12 Irene Vogel
many others, Anderson 2005; Booij 1996; Hall 1999; Itô and Mester 2003,
2007, 2009a, b; Peperkamp 1997; Selkirk 1996, 2011; Vigário 2003), although
there are also proposals for recursive phonological phrase and intonational
phrase constituents (among others, Gussenhoven 2004, 2005; Itô and Mester
2007, 2009a, 2012; Ladd 1986, 1996/2008; Selkirk 2011; Truckenbrodt 1999).
Different approaches to constructing the constituents have also been
advanced. The original procedure, which has come to be referred to as the
relational approach, used various types of morphosyntactic information (e.g.,
XP structure, side and branchingness of complements, functional elements) in
the mapping algorithms for constructing prosodic constituent structures (e.g.,
Selkirk 1978, 1980a, 1986; N&V 1982, 1986/2007). Subsequent methods of
constituent construction have made use of morphosyntactic interfaces as well
but have relied on a series of different types of principles, indicated by their
names, for example, Alignment Theory (e.g., Selkirk 1986; Selkirk and Tateishi
1991), Wrap Theory (e.g., Truckenbrodt 1999), the adjunction approach (e.g.,
Itô and Mester 2009a, b), and most recently, Match Theory (e.g., Selkirk 2011).
By contrast, there is another category of constituent construction model,
referred to here as Phenomenon-Based, where prosodic constituents are not
created on the basis of mappings from other grammatical constructs, but
rather on the basis of specific phonological phenomena observed in a par-
ticular language. For example, the Tone and Break Indices (ToBI) Approach
establishes a series of constituents in a language in relation to observed pitch
patterns and boundary phenomena such as lengthening and pausing (e.g.,
Beckman and Ayers 1994; Beckman and Hirschberg 1994; Jun 2005a; Venditti
2005; see Jun 2005b, 2014 for overview and language studies). Additionally, in
the Distributional Typology approach, prosodic constituents are constructed
as needed, based on the application of a language’s phonological rules and/or
other patterns (e.g., Bickel et al. 2009; Schiering et al. 2007, 2010).
In the earlier models of the prosodic hierarchy, the overall geometry
was substantially restricted by the SLH; however, the SLH was soon found
to be too restrictive. Thus, despite differences in the number of prosodic
constituents and the means by which they were constructed, most subsequent
developments of prosodic theory have shared a common challenge of how to
appropriately weaken the SLH.
Cn-1 Cn-1
Domain span rules apply throughout a string of category Cn, without regard
for any internal structure; however, it was understood that, due to the SLH, the
internal structure of Cn could only contain one or more Cn-1 constituents, and
that each of these constituents would be similarly structured. Domain limit rules
require the presence of a left or right edge of a given constituent type. It was also
understood that the edge in question would coincide with the corresponding
edge of the next lower level, and any additionally lower levels. In domain junc-
ture rules, two constituent levels must be taken into consideration, and the SLH
ensured that the juncture was between two constituents of the same type, and
that these constituents were contained within the same larger constituent.
As a consequence of determining what types of prosodic structures and
rules were possible, the SLH also made specific claims about what we would
not expect to find in languages. For example, it was predicted that we would not
14 Irene Vogel
find structures (and rules applying to structures) such as those in (4). To facili-
tate identification of the constituents, the phonological words are indicated
in bold and the phonological phrases are enclosed in braces (i.e., {}); where
multiple constituents of the same level are present, they are numbered sequen-
tially. The symbols ι, φ, ω, Σ, and σ represent, respectively, the intonational
phrase, phonological phrase, phonological word, foot, and syllable.8
In this case, not only do the ωs coincide with CGs, but they also happen
to coincide with feet, and the feet with syllables. While such overlapping
structures can be found in English, they do not constitute a substantial
lo si serve
itCL oneCL serves ‘one serves it’
In order for the two clitics, lo and si, to combine into a CG (or phono-
logical phrase in a tree lacking the CG), according to the SLH, they must be
ωs, like the verb serve. This is problematic, however, since the clitics do not
16 Irene Vogel
otherwise have the properties associated with ωs (e.g., they only contain a
single mora and fail to satisfy word minimality; consequently, they also do not
exhibit stress like other ωs). Thus, while promoting the clitics to ωs allows the
CG to consistently dominate constituents only one level lower in the prosodic
tree, doing so compromises the crucial characteristics of the ω itself (among
others, N&V 2007; Vogel 1999, 2009).
At first glance, it might seem possible to combine the two clitics in (6) into a
ω, presumably by first combining them into a foot. This would yield a structure
that meets word (and foot) minimality, consistent with Itô and Mester’s (2003)
Maximal Parsing constraint that groups two syllables into feet in Japanese
word clippings, and two monosyllabic function words into feet in German
(Itô and Mester 2009a, following Kabak and Schiering 2006). Such a struc-
ture, however, yields incorrect results in Italian. That is, if the sequence lo si
constitutes a ω (i.e., [lo si]ω], it would incorrectly be subject to the (Northern)
Italian Intervocalic s-Voicing rule, which applies within the ω domain (e.g.,
N&V), as shown in (7).
s fi da ‘challenge’
b. syllable extrametricality
ca ser ma ‘barracks’
In (8a), /s/is excluded from the syllable onset with /f/in accordance with
the Sonority Sequencing Principle, and parsed directly into the ω, skipping
both the syllable and foot levels.13 In (8b), stress is on the penultimate syllable,
the head of its foot. The first (light) syllable cannot be included in the foot,
nor can it form a foot on its own, so it is parsed at the ω level, skipping the
foot level.
18 Irene Vogel
Given the precedents for skipping levels in the lower prosodic constituents,
weakening the SLH to permit the skipping of levels in the interface
constituents does not introduce a completely foreign option into phono-
logical structure, and it offers a solution to several problems mentioned in the
previous section. For example, if smaller constituents are no longer promoted
to larger constituents for which they lack the necessary properties, the struc-
ture in (6) above can be revised as in (9), where the syllables corresponding to
the clitics are parsed directly at a higher constituent level Cn (i.e., composite
group in the present model).
= Cn-1
lo si serve
itCL oneCL serves ‘one serves it’
The revised structure avoids creating subminimal ωs, as well as the incorrect
combination of the clitics into a ω, where they would be expected to undergo
ω level phonological phenomena (cf. (7) above). Crucially, the structure in
(9) also makes the correct prediction regarding the lack of phonological inter-
action between adjacent clitics, and between clitics and their host. That is,
neither the /s/of the clitic si nor that of the verb serve becomes [z]since their
intervocalic contexts do not fall within the ω. Note that ISV is also correctly
predicted not to apply with a clitic following its host (e.g., guardandoω si not
guardandoω*[z]i ‘looking at oneself’).
Finally, although it is not the focus here, it should be noted that skipping
levels has also been proposed for higher constituents of the prosodic hierarchy.
For example, in Selkirk’s (2011) analysis of the Bantu language, Xitsonga, an
intonational phrase may directly dominate a phonological word, skipping the
φ level, as illustrated in (10).14
It is argued that the final phonological word is parsed directly into the
intonational phrase since it undergoes high tone spread from mu-nw!í; if
1.3.2 Recursion
Removing the strict dominance requirement of the SLH not only permitted
prosodic levels to be skipped, but it also opened the door for recursion. If a
constituent is not required to dominate only constituents of the next lower
level, it could just as well dominate constituents of its same level, or even a
higher level. These two options were seen above in (2b) as Recursion 1 (i.e.,
[… [ ]Cn]Cn), and (2c) as Recursion 2 (i.e., [… [ ]Cn+1]Cn). While both types of
recursion are found in syntax, only Recursion 1 is typically proposed for
prosodic structure; thus, an additional principle may be needed to exclude
Recursion 2. Note that if both types of recursion are allowed in prosodic
structure, the SLH is effectively eliminated, not just weakened.
Although there is no single type of motivation provided for the introduc-
tion of recursion across prosodic levels, the main considerations involve the
avoidance of constituent proliferation and the expression of similarities between
certain types of strings. The potential parallelism between prosodic and (recur-
sive) morphosyntactic structures is also considered a motivation in some cases,
especially at the higher prosodic levels (φ and ι), as discussed further below.
Like skipping levels, recursion has precedents in the lower, non-interface
prosodic constituents. For example, Recursion 1 has been proposed to account
for extrasyllabic consonants, so instead of the type of structure seen above in
(8), a consonant that is excluded from a syllable for violating the Sonority
Sequencing Principle (SSP) would be included in a recursive syllable (σ’), as in
(11). In this larger σ’, the SSP is no longer in effect (among others, McCarthy
1979; see discussion in Watson 2011).15
’ ’
[l æ p s] [s t æ b]
20 Irene Vogel
Ca na da Ca na da
While the lower foot structure in (12a) meets the requirement that feet be
binary branching, with maximally two syllables, the upper foot, or Σ’, fails
to meet this requirement. It might be argued that the Σ’ is binary branching,
dominating a Σ and a σ; however, the content of Σ and Σ’ is nonetheless dis-
tinct. Thus, the seemingly recursive syllable and foot structures, in fact, exhibit
different, rather than the expected similar, properties at the repeated constituent
Although the role of the mora in the prosodic hierarchy is not totally clear,
recursion has been proposed for this element as well. In this case, recursion
is usually introduced to parse non-moraic segments with moraic ones, for
example, combining an onset consonant with the vowel in a CV syllable (e.g.,
[t [a]μ]μ’).16 Recursive moras have also been proposed for Arabic as a means
of distinguishing between segments that do and do not count (i.e., contribute
weight) for the purpose of stress assignment. As shown in (13), only recursive
moras (i.e., with a moraic presence at both the lower and upper levels) con-
tribute to syllable weight, so the structure in (13a) constitutes a heavy syllable
but the one in (13b) does not (e.g., Hayes 1995; Watson 2002, 2011).
’ ’ ’
| | |
lo si serve
itCL oneCL serves ‘one serves it’
Again, there is a problem if C and C’ are considered the same type of con-
stituent (ω) since they exhibit different phonological behaviors. As with level
2 prefixes, the ω-domain ISV rule fails to apply in a ω’ with clitics, and both
instances of /s/remain voiceless (i.e., [lo si [serve]ω]ω’, not *[lo zi [zerve]ω]ω’), as
noted above.
Although they do not necessarily involve stray elements, compounds are
also often analyzed as recursive phonological words. The individual members
form ωs on their own, and when they are combined into a compound word,
the result is labeled ω’, as in (15) and (16).
police academy
22 Irene Vogel
’ ’
fish bowl light
24 Irene Vogel Assessment of skipping levels
The single change of removing strict dominance results in an enormous
increase in possible constituent structures, as was illustrated above in (4). The
simple structures in (18), without recursion or a constituent between the ω
and the φ, offer further insight into the magnitude of the increase.
26 Irene Vogel
[grámmar]ω / [grammát-ical]ω / [grammat-icál-ity]ω). By contrast, level 2 affixes
are excluded from the ω, and do not participate in stress assignment (e.g.,
[féver]ω / [[féver]ω ish]ω’ / [[[féver]ω ish]ω’ ly]ω’). Similarly, clitics do not partici-
pate in word-level stress assignment (e.g., [[séver]ω it]ω’ / [[[séver]ω ing]ω’ it]ω’).
Furthermore, as noted above, the individual members of compounds have
stress assigned to their own ωs, while the whole compound undergoes the
Compound Stress Rule (e.g., [[féver]ω [blíster]ω]ω’). Considering the various ω
and ω’ structures to be the same type of constituent suggests that they should
have the same stress properties, which is clearly not the case.
Another well-known type of stress pattern that also exhibits a difference
between the ω and ω’ is the “trisyllabic window” found in Italian and other
languages, according to which stress must appear on one of the last three
syllables of a ω.22 When clitics are added in a ω’, however, the same restriction
is not observed, as illustrated in (19).
teléfona me lo
telephone (to) meCL itCL ‘telephone it to me’
The antepenultimate stress in the verb form teléfona falls within the trisyl-
labic window, and when clitics are added, the stress remains on that syllable.
Thus, in the ω’ in (19), it appears on the fifth-to-last syllable.23
With regard to segmental phenomena, it was seen above that the Italian
ω domain rule of Intervocalic s-Voicing does not apply with level 2 prefixes,
clitics, or compounds, all of which would be parsed as ω’. Thus the ω, which
exhibits ISV (e.g., [i[z] ola]ω ‘island’, [noi-o[z]-in-o]ω ‘somewhat boring’
(< bore-adj-dim-m,sg)),24 is distinct from the various ω’ structures, which
do not exhibit ISV (e.g., [lo [s]i ri-[[s]ala]ω]ω’ ‘one resalts it’ (< itCL oneCL
re-salts), [[dicendo]ω [s]e lo]ω’ ‘saying it to oneself ’ (< saying selfCL itCL),
[[porta]ω [[s]apone]ω]ω’ ‘soap dish’ (< carry soap)). Languages with vowel
harmony (VH) also consistently exhibit discrepancies between the phon-
ology of the ω and ω’. While VH typically applies throughout a ω, it does
not usually apply throughout a ω’ consisting of a compound word with mul-
tiple ωs (e.g., Hungarian: [olvasó]ω1 [terem]ω2]ω’ ‘reading room’ (ω1 = +Back;
ω2 = -Back).
Thus far, the differences between the ω and ω’ constituents have involved
the application of rules within the smaller ω domain but not the larger ω’;
however, there are also cases in which rules apply within the ω’ but not the ω.
For example, in English, the well-known Voicing Assimilation rule applies in
the ω’ domain, as seen with the addition of a (level 2) plural or third person
singular –s, which is voiced following a voiced (non-strident) segment (e.g.,
[nz]: (the) [[fan]ω-s]ω’, (he) [[fan]ω-s]ω’), but voiceless after a voiceless segment
28 Irene Vogel
In prosodic phonology, the number of proposed constituents has not
reached “cancerous” proportions as feared. In fact, in any Interface-Based
prosodic model, where constituents are constructed via general mapping
procedures between morphosyntactic constituents and phonological struc-
ture, the number of constituents is automatically restricted. Thus, the rela-
tional approach to the prosodic hierarchy in N&V comprised five interface
constituents. As noted above, an additional prosodic stem constituent has
been proposed in some cases, but this too is based on a specific, morphologic-
ally identifiable element. One or two tone-related constituents (e.g., accen-
tual, major, minor phrases) have also been proposed in some analyses, but
these tend to coincide roughly with other established constituents (e.g., Itô
and Mester 2012; Selkirk and Koichi 1988; Shinya et al. 2004). In Match
Theory (Selkirk 2011), we find six domains, presented as three pairs of recur-
sive constituents (i.e., the inner and outer variants of the basic phonological
word, phonological phrase, and intonational phrase domains), all of which
are established in relation to specific morphosyntactic structures.
As mentioned earlier, of the five prosodic constituents in N&V, the CG
has often been viewed with suspicion and removed from the prosodic hier-
archy. Since the CG, and its subsequent development as the composite group
(e.g., Vogel 2009), was constructed in relation to specific morphosyntactic
elements (e.g., Hayes 1989; N&V), it did not, in fact, pose a risk of initi-
ating a slippery slope toward the unchecked proliferation of constituents.
Moreover, removing the CG does not remove the fact that there are clitics,
and other stray elements, that must be accommodated in some way in the
prosodic hierarchy. In fact, it is precisely such elements that are typically
parsed in the ω’.
Even in the Phenomenon- Based prosodic models that construct
constituents specifically to accommodate the phonological phenomena of
a given language, relatively few additional types of constituents have been
introduced. For example, in the ToBI Approach (see Section 1.2.1.), although
proliferation is not excluded on principle, only a small number of different
constituents have been proposed. Where we do, however, see a proliferation of
constituents is in the Distributional Typology approach developed by Bickel
and colleagues (see Section 1.2.1). Here too, though, it is not the constituent
categories per se that have proliferated, but the number of recursive levels of
a single constituent, in particular, the phonological word, as in (20).
Although at first glance such a structure may not appear to result in pros-
odic constituent proliferation, closer examination reveals exactly the same
problem that arose with multiple boundary types and lexical levels. That is,
each of the phonological word levels is constructed to account for a different
30 Irene Vogel
(22) Max and min levels in recursive prosodic structures (Selkirk 2011)
a. Cmax = C level constituent not dominated by another C
b. Cmin = C level constituent not dominating another C
In fact, the resulting six structures roughly line up with the domains iden-
tified in N&V, as shown in (23), where the composite group replaces the clitic
group. Indeed, Match Theory offers an option not present in Nespor and
As can be seen, the adjusted phonological structure in (24) is flatter than the
originally mapped structure, but now the resulting phonological constituents
no longer exhibit the recursion of the original syntactic structure. Moreover,
it is not clear what the additional constituents represent. If, as is argued, only
the uppermost and lowest levels of a constituent type need to be identified
(i.e., Cmax and Cmin), any intermediate levels would then be without a pros-
odic status, for example, the three (x)ι constituents in (24). If, instead, such
constituents are relabeled as (x)ιmin, the definition of this type of constituent is
not consistent across the different instances, and again, it is not clear how the
phonological structure is recursive in parallel to the corresponding recursive
syntactic structure.
32 Irene Vogel
plural markers are also parsed at this level (i.e., [[blue]ω [berry]ω s]ω’). By con-
trast, in the phrase blue berries, the plural pertains only to the noun berry, so
the structure just requires two ωs, which coincide with the highest ω’ for each
word (i.e., [[blue]ω/ ω’ [berries]ω/ ω’]φ).
At the higher levels of the prosodic hierarchy, the structures are not simi-
larly restricted, since clitics and function words may be parsed in any con-
stituent (i.e,. φ/φ’ or ι/ι’), typically in parallel with their syntactic structure.
“Directional clitics” (DCLs) are the exception, since they must always attach
to a host either on the right or the left regardless of the syntax, as in the
leftward attachment of the auxiliary and copula –s in English (e.g., Klavans
1982, 1985; Zwicky 1984; N&V). It was noted above (cf. (18)) that the lack
of a prosodic restriction on the parsing of stray elements at the higher levels
predicts the possibility of numerous structures, as illustrated in (25), with
one and two stray syllables to the right of the head constituent; however, any
number and sequence of syllables and/or other elements could be included,
also to the left of the head.
FW FW [ FW FW [ ] ] ’
34 Irene Vogel
are skipped, it is possible to arrive at structures in which C’ dominates another
C, and possibly other elements such as Cn-2, but not Cn-1.
(28) Italian Intervocalic s-Voicing: does not apply across ωs within κ (/s/ [s])
a. [[lo]σ [si]σ [ri]σ [sala]ω]κ [lo si risala] ‘one re-salts it’
b. [[comprando]ω [se]σ [lo]σ]κ [komprando se lo] ‘buying it for oneself’
c. [[porta]ω [sapone]ω]κ [porta sapone] ‘soap dish’
It should be noted that even if ri- is not recognized as an affix due to the
lexicalized meaning of (29), ISV is still correctly predicted since ri-would then
be considered part of the root, and thus automatically part of the ω.
In addition, it can be seen that an Italian phonotactic constraint on the pal-
atal lateral [ʎ] is straightforwardly accounted for by the distinction between
the ω and κ constituents. That is, while [ʎ] is excluded from the onset of a
syllable at the beginning of a ω, it may appear in the onset of a syllable in
other positions within the κ. Thus, “gl” is pronounced as [gl] rather than [ʎ]
ω-initially in (31), but as [ʎ] in other positions, as in (32).
Note that [ʎ] is also allowed as a syllable onset word internally, where it
may arise as part of a geminate (e.g., figli [[fiʎ.ʎi]ω]κ ‘sons’).
The English voicing assimilation patterns also crucially differ in the ω
and κ constituents, with the former being more permissive than the latter.
For example, as noted previously, we find both assimilated and unassim-
ilated sequences involving / s/within the ω (e.g., [z] : [cleanse]ω, [Mars]ω;
[s]: [fence]ω, [parse]ω); however, beyond that level, only assimilated sequences
may appear, regardless of the nature of –s. By parsing all of the “stray” –s
morphemes in the same way within the κ, as shown in (33), the similarity in
36 Irene Vogel
their phonological behavior is accounted for, as is their difference from the
ω level behavior.
38 Irene Vogel
(37) Minimal Distance: Parse phonological material into the first available
prosodic constituent.
This principle places a strong restriction on possible ωs, and thus excludes
certain types structures that have been deemed ωs in previous analyses. In par-
ticular, it excludes ωs consisting of only an affix or function word, even if it is
phonologically “substantial” (among others, Booij 1985, 1999, 2007; Itô and
Mester 2009a; Vigário 2003; Weise 1996). It also excludes ωs consisting only of
a combination of affixes and/or function words (e.g., Dixon and Aikhenvald
2002). As seen above, in the Composite Prosody Model, such “stray” elements
are parsed directly in the Composite Group, both avoiding the need to define
40 Irene Vogel
the ω differently in different situations and keeping the prosodic structure to
a minimum.
In analyses where more “substantial” functional elements or affixes are
analyzed as phonological words, this is typically done on the basis of prop-
erties that, in fact, coincide with foot properties (e.g., weight or prominence).
While this coincidence is not surprising, since the minimal (phonological)
word is usually coextensive with a foot, relabeling certain feet or combinations
of syllables as ωs essentially undermines the notion of universal prosodic
constituents. Some ωs are defined via mapping rules from morphology, while
others are defined in a language-specific way dependent on what is deemed
a ω in a given language. In fact, this outcome is similar to the problematic
renaming of various elements as ωs in N&V to satisfy the SLH, although the
items in question did not conform to the more general properties of the ω. In
the present proposal, the items in question only need to constitute feet, and
these in turn are parsed directly in the composite group. They thus exhibit
the necessary weight or other phonological properties, without unneces-
sarily being ascribed morphological attributes associated with the interface
mapping (e.g., Vogel 2009, 2010, 2012).
The morphological core is the basis for the minimal ω; however, other
material is often included as well, specifically any so-called cohering or level 1
affixes that interact phonologically with their roots. While the classification of
individual affixes as level 1 (or some equivalent) is based on language-specific,
or even item-specific, considerations, these details are not what is relevant for
the mapping principles. As proposed in Kabak and Vogel’s (2001) analysis
of Turkish, what is crucial for phonological word mapping is just the (non-)
cohering status of affixes, regardless of how this has been determined for a
given language. Specifically, the non-cohering affixes are identified as Prosodic
Word Adjoiners (PWAs), signaling that they attach to a ω, not within a ω. The
relevant information can be encoded in the form of a subcategorization frame,
along with other indications such as what part of speech an affix attaches to,
and whether it attaches to the left or the right of its base.
In Turkish, regular stress assignment applies to the final syllable of a ω.
While this may include many suffixes, given the agglutinating nature of the
language, not all affixes participate in regular stress assignment, and this is
encoded by their PWA status, as illustrated in (40); the stressed syllable is
If multiple affixes are PWAs, all that is necessary is that the first PWA
establish the end of the ω constituent; any subsequent PWAs make reference
to this boundary, as illustrated in (42).
Both -less and -ness are PWAs, and once -less establishes the right edge of
the ω, -ness recognizes this edge; it does not require another ω edge to its left.
Thus, no additional structure is introduced, and word stress applies within the
ω to the first syllable of father.
Thus far, we have seen how level 2 affixes, which are excluded from the ω,
are parsed in the κ constituent; however, it was seen above that other types of
stray elements that do not constitute ωs (i.e., clitics and other types of function
words) are similarly parsed. Typically, these elements interact phonologically
with the item to their left or right depending on which is more closely related
morphosyntactically, as illustrated in (43).
42 Irene Vogel
In (43a) and (43b), the clitics mi and lo are parsed to the left or the right of
the verb according to their syntactic structures, and in both cases, /i/changes
to [e]since mi is followed by another clitic in the same κ. The change does not
occur, however, in (43c), where the clitics are parsed in separate κs, following
their syntactic structure (e.g., Vogel 2009). Directional Clitics are parsed to
the left or right regardless of their syntactic position, and interact with the
material in the κ they form part of.
Since the Principle of Minimal Distance parses all stray elements,
including DCLs, at the κ level, the correct generalization is made with
regard to the similarity of their behavior. That is, elements that are parsed
in the same way exhibit the same phonological patterns, regardless of
their position in syntactic structure. Thus, as illustrated in (44), the –s of
the English auxiliary and copula has the same phonological status as the
plural and third person singular suffixes and the possessive, despite the fact
that the auxiliary and copula are syntactically more closely related to the
material to the right.
In (44b, c), the -s is pronounced as [z], assimilating to the voiced segment to
the left, just like the items in (44a); it does not assimilate to the voicelessness
of the /f/of the more closely related word to the right.
Differently from the CG in N&V, the composite group also includes the
members of compounds, and thus the κ mapping procedure must parse
together the multiple ωs of compounds, but not those of phrases.28 This is
accomplished by the Principle of the Morphological Maximum, which imposes
a maximal limit of one lexical word (LW) per κ, as stated in (45), capturing the
generalization that the combined members of compounds constitute a single
lexical item.
44 Irene Vogel
words), and as can be seen, the ω and κ structures account for the pertinent
phonological phenomena (i.e., Intervocalic s-Voicing, clitic /i/change to [e],
trisyllabic (stress) window) as effectively as they did in the simpler cases in the
previous sections.
Both the suffix -ize and the function word whether are phonologically sub-
stantial, and while they resemble ωs (e.g., lexical items eyes and weather), they
are analyzed as feet here. This allows them to exhibit the relevant phono-
logical properties, including prominence on the first syllable of whether, while
reserving ω status for lexical items, which contain a morphological core.
The reduced forms of the pronouns provide additional insight into the
ω and κ, and demonstrate that they not only succeed in accounting for the
In such flat κ structures, the question is how to ensure the correct appli-
cation of any relevant phonological phenomena, for example, the voicing
assimilation of the /s/in (49a) to [z]after the /r/of writer, and not to [s] before
the /k/ of cramp. In fact, no additional information is necessary. The sub-
categorization frame that indicates the PWA status of -s also encodes its
direction of attachment as a suffix (i.e., to the right of a ω); similarly, the sub-
categorization frame for the plural -s attaches it as a suffix following cramp,
yielding [s] after the voiceless /p/. By the same token, the subcategorization
frames associated with the various affixes in (49b) account for their direction
46 Irene Vogel
of phonological interaction. In both cases, the Compound Stress Rule applies
to enhance the first member of the compound (i.e., the first ω of the com-
posite group).
Finally, the parsing of stray elements directly in the composite group
avoids what have been considered “ordering” or “bracketing paradoxes” in
other models. That is, the κ’s relatively flat structure does not encode infor-
mation corresponding to the order of morpheme attachment, and thus it
does not present the opportunity for paradoxes to arise. The only type of
morphological information that is required is whether an element is a level 1
or a level 2 affix, the latter indicated by its PWA subcategorization property.
For example, in (51) and (52), it can be seen that the order of attachment of
the affixes, indicated by the level subscripts 1 and 2, is not reflected in the
corresponding κs.
(52) Ordering paradox: Different morphological structures with the same elements
a. Morphological Structure 1: [un [[lock]V able]Adj]Adj (= cannot be locked)
b. Morphological Structure 2: [[un [lock]V]V able]Adj (= can be unlocked)
c. Prosodic Structure: [un [lock]ω able]κ
In (51b), the fact that un-is not parsed in the ω depends only on its PWA
status. Since neither -ical nor -ity is a PWA, they both form part of the ω and
participate in its stress assignment, even if un- is morphologically attached
between the two. The PWA status of the plural -s allows it to be parsed dir-
ectly in the κ, where it observes the necessary voicing assimilation pattern. The
insensitivity of the phonology to the order of morpheme attachment is seen
further in (52), where words with different internal morphological structures,
and corresponding meanings, are prosodically structured, and pronounced,
in the same way (52c).
Compounds also frequently result in ordering paradoxes, for example,
when inflections apply (morphosyntactically) to an entire compound, but
interact phonologically only with the adjacent element. As seen in the English
example in (53a), although the plural –s pertains to the entire compound, it
is pronounced as [z]due to its assimilation to the directly preceding voiced
segment within the composite group. Similarly, in languages with vowel
harmony, although an inflection may pertain to an entire compound, it
participates in the harmony of the linearly adjacent material. Thus, in the
Hungarian example in (53b), while the first member of the compound has
front vowels, the rest has back vowels, including the two suffixes, which har-
monize with the directly preceding (back) root.
48 Irene Vogel
phrase, and syntax for the higher constituents, it has been seen in previous
sections that the two types of prosodic constituents require different types
of mapping procedures. While the former must be built up from smaller to
larger elements, the latter are established on the basis of fully formed syn-
tactic structures. Moreover, the relational mapping of the lower prosodic
constituents advanced in the Composite Prosody Model excludes recursive
structures, while these may be permitted in the latter, paralleling the recursive
structures in syntax. Indeed, it appears that the higher constituents may cor-
respondingly exhibit more repetitive phenomena, for example multiple con-
stituent edge markings (e.g., Penultimate Vowel Lengthening in Xitsonga at
the right edge of ι and ι’ (Selkirk 2011)), and tonal contours spreading across
repetitions of φ and ι structures (e.g., Ladd 1986, 1996; Itô and Mester 2012;
Selkirk 2011 among others).
The difference between the syntactic interface of the higher constituents
and the morphological interface of the lower constituents is also reflected
in the presence of exceptions. While the phonological phenomena applying
in the former appear to be fully regular, the phenomena of the latter may
be more limited and exhibit idiosyncrasies or exceptions. It was seen, for
example, that in English /n/completely assimilates to a following /l/or /r/only
with the in-prefix within the ω constituent, and in Italian, the rule changing
/i/to [e]applies only in certain sequences of clitics, in the κ constituent.32
Additionally, within the ω, there may be “disharmonic” patterns in vowel har-
mony languages, and more idiosyncratic pattern such as the different ways the
final /d/in a word such as divide surfaces when followed by different level 1
suffixes (i.e., [s]: divis-ive; [z]: divis-ible; [ʒ]: divis-ion).
Finally, a rather different type of property can also be seen to distinguish
the upper and lower portions of the prosodic hierarchy, the potential effect
of extragrammatical phenomena. At the higher levels, considerations such
as speech rate and the size or weight of constituents may override the basic
prosodic constituent mapping rules, and consequently alter the domains of
application of their phonological phenomena. Indeed, Selkirk (2011) points
out that this is fairly characteristic of the ι, and not uncommon in the φ (e.g.,
Italian Raddioppiamento Sintattico (N&V), Lekeitio Basque tonal patterns
(Elordieta 1997, 2007)). The same flexibility is not, however, characteristic
of the lower ω and κ constituents, and in fact, different applications of their
phonological phenomena would most likely signal some sort of error, not
simply an alternate phrasing option.
In sum, it is clear that there are multiple fundamental differences between
the prosodic constituents below the phonological phrase and the higher
constituents. The question is whether such differences warrant essentially
two distinct prosodic hierarchies, or whether there is some way to retain a
single prosodic hierarchy. In either case, the problem that must be addressed
if we make a distinction between the “bottom up” and “top down” mapping
procedures of the different types of constituents, is how to transition from
50 Irene Vogel
Intonational Phrase ( , ’, ( ))
syntax to
Syntax Interface
Phonological Phrase ( , ’ )
Composite Group ( )
Morphology |
Interface Phonological Word ( )
elements to larger
Mapping: small
Foot ( )
No interface Syllable ( )
level 2 affixes. In order for a φ to include full lexical items (with level 2 affixes),
it must be composed of the prosodic constituents that incorporate all of the
affixes, composite groups. Since the κs also parse stray functional elements, all
of the material in a given κ will be parsed in the corresponding φ. Where there
are “rough edges”, cases where bits of the lower prosodic constituents do not
align with the domains delimited by the syntax, it is the lower constituents
that prevail. That is, clitics and other functional elements come along in the
composite groups that have been built up by the relational mapping pro-
cedure, even if they are not consistent with the syntactic parsing, as in the case
of directional clitics. The transition between the lower and upper portions of
the prosodic hierarchy is illustrated schematically in Figure 1.3.
In Figure 1.3, an intonational phrase dominates two phonological phrases,
φ1 and φ2, as determined by a syntactic mapping procedure. Each φ includes
at least one lexical item, which consists of a ω and any associated material,
grouped into a κ. The shaded “x” (e.g., a directional clitic) is syntactically
part of the phrase corresponding to φ2, but since it does not interact phono-
logically with this phrase, but rather with the element to its left, it forms part
1.7 Conclusions
This chapter has considered a number of modifications of the original model
of prosodic phonology, as articulated in Nespor and Vogel (1986). Consistent
across the proposals and theoretical perspectives is the recognition that the
SLH, and specifically the principle of strict dominance, was too restrictive
and thus needed to be weakened. Since relaxing the restrictions on any type of
system, by definition, gives rise to previously excluded options, it creates new
challenges of determining whether all of the additionally permitted options
are desirable, and if not, what other types of restrictions must be instituted to
appropriately constrain the system.
As was demonstrated, weakening the SLH results in considerable
overgeneration of prosodic structure configurations. Aside from the sheer
number of possibilities, which is at least intuitively implausible, the add-
itional options result in incorrect predictions about the types of phonological
patterns that will be observed in languages, as well as the loss of generalizations
among the phenomena within a given language. Three types of approaches to
counteract the problems that arise from weakening the SLH have thus been
examined: Selkirk’s (2011) Match Theory, Itô and Mester’s (e.g., 2009a, b)
Adjunction Approach, and the Composite Prosody Model advanced here.
It was demonstrated that while Match Theory highly restricts the mapping
relations between the morphosyntax and phonology, returning to a model in
which prosodic constituents closely mirror syntactic structures, it also permits
a vast increase in the internal configurations of the prosodic constituents. In
particular, it allows stray elements to be parsed at any level of the prosodic
hierarchy, and it includes recursive constituents at all three of the levels it
recognizes: phonological word, phonological phrase, and intonational phrase.
Aside from the large number of additional prosodic configurations, the recur-
sive constituents were also shown to introduce a number of problems with
regard to the nature and definition of both recursion and the constituents
themselves. The reliance on syntax in constructing the phonological word,
moreover, was shown to obscure the well-established phonological differences
between level 1 and level 2 affixes.
The Adjunction Approach adopts the same approach as Match Theory
for the mapping of the phonological phrase and intonational phrase; how-
ever, differently from Match Theory, it substantially limits the possible
52 Irene Vogel
prosodic configurations by restricting the appearance of stray elements to
below the phonological phrase. It also crucially differs from Match Theory
in distinguishing between level 1 and level 2 affixes. The inclusion of recur-
sive constituents, however, introduces the same types of problems with the
definitions of the prosodic constituents and recursion that arise in Match
Differently from both Match Theory and the Adjunction Approach, the
Composite Prosody Model advanced here crucially includes a prosodic con-
stituent between the phonological word and the phonological phrase, the
composite group. As was demonstrated, this constituent not only provides
the necessary domain to straightforwardly account for a range of phenomena
across languages that are problematic in the other models, but it also allows
us to avoid recursion, at least below the phonological phrase, and thus the
various drawbacks that accompany recursive constituents.
In parsing together a number of different stray elements (i.e., level 2
affixes, clitics, other function words), as well as compounds, the composite
group correctly predicts similarities in their phonological behavior, as dis-
tinct from those of phonological words, on the one hand, and phonological
phrases, on the other hand. It was shown, furthermore, that parsing the
stray elements in the composite group also effectively limits the possible
prosodic configurations, since the elements in question may not appear else-
where in prosodic structures. This, in turn, substantially limits the range of
phonological structures and phenomena predicted to be possible in human
Examination of a number of fundamental distinctions between the pros-
odic constituents that interface primarily with morphology and those that
interface with syntax, at first glance appeared to suggest that there may, in
effect, be different prosodic hierarchies for the two types of constituents.
While the syntax-interface constituents seem to closely mirror the structures
from which they are mapped, possibly including recursion, the morphology-
interface constituents may diverge substantially from the corresponding
morphological (and syntactic) structures from which they are derived, via a
relational mapping. The phonological phenomena that apply in the former,
moreover, appear to be exceptionless, while the phenomena associated with
the latter often exhibit limitations and idiosyncrasies. The former also appear
to be subject to extragrammatical considerations such as their weight or size
and rate of speech, while the latter are not.
While the simplest model of phonological interfaces would certainly be
one in which the same mapping principles apply consistently at all levels,
the fact that there are different constellations of phenomena in different
portions of the prosodic hierarchy indicates that such simplicity is not ten-
able. The Composite Prosody Model offers a more nuanced view of the pros-
odic hierarchy that features a tripartite structure, with distinct properties
associated with each of the different types of constituents, those interfacing
primarily with syntax or with morphology, and those that do not interface
1 I am grateful for the discussion and comments I received on an earlier version
of this chapter from the participants at the First International Conference on
Prosodic Studies: Challenges and Prospects (Tianjin, China; June 2015). Of
course, all shortcomings are my own.
2 The body of research on the Prosodic Hierarchy is by now quite vast. It is not the
intention to provide a review of this body of literature here, but only to highlight
some core issues and representative works as they pertain to the questions under
investigation. Other recent publications offer detailed background, summaries,
and analyses of various aspects of Prosodic Phonology. For a particularly thor-
ough discussion, see Scheer (2010).
3 Some analyses have also argued for direct reference to syntactic constituents
(among others, Cinque 1993; Kaisse 1985; Odden 1987, 1996, 2000); however,
given the existence of numerous phenomena that clearly do not apply in syntactic
domains, a model that uniquely relies on syntax cannot be adequate. Selkirk’s
(2011) recent Match Theory returns to a more direct reliance on syntax, but
nevertheless leaves some room for differences between syntactic and phonological
structures, as will be discussed below.
4 The abbreviation for the Clitic Group was just “C” in N&V. The mora was not
included in N&V, but it has subsequently been included in some hierarchies since
54 Irene Vogel
it consists of structure beyond a single segment and participates in prosodic
phonological phenomena (e.g., stress assignment, tonal patterns).
5 The listing of variations here is only meant to be illustrative. For recent summaries
and discussions of the developments in Prosodic Phonology, the reader is referred
to Scheer (2010), Dehé et al. (2011), and Selkirk (2011), among others.
6 Note, however, that the original Phonological Utterance could include more than
one sentence (e.g., N&V; Vogel 1986), although this is not possible in a model such
as Match Theory.
7 Throughout this chapter, phonological phenomena are usually referred to as rules,
and derivational-type formulations are used to represent them. This is done as
a matter of expediency since such formulations tend to be descriptively simple
and clear; it is not intended as an argument for this type of approach over some
other type.
8 For simplicity, the Clitic Group is omitted here and elsewhere unless it is crucial
for a given discussion. It is not, however, the intention to ultimately exclude such
a constituent from the Prosodic Hierarchy, as will be seen below.
9 Such an argument is of course not relevant to approaches that do not assume a
universal set of constituents (e.g., Schiering et al. 2010).
10 The form si has several functions in Italian, so there could be more than one trans-
lation for examples using this element here and below. In each case, one possible
translation is provided.
11 The rule is stated informally here, but in fact, it also applies in the presence of
glides (i.e., [-cons] segments).
12 This independence is recognized in constraint-based analyses that include sep-
arate constraints and rankings pertaining to skipping levels and recursion (among
others, Itô and Mester 1992, and other publications; Selkirk 1996, and other
publications; Truckenbrodt 1999).
13 The parsing of /s/directly into the ω, furthermore, allows it to remain available
for syllabification as the coda of a preceding word as needed (e.g., following a
stressed vowel as in tre sfide [trés.fí.de] ‘three challenges’). (See among others Vogel
1977, 1982.)
14 The Xitsonga examples presented in Selkirk (2011), and discussed elsewhere this
chapter, are based on material derived from Kisseberth’s (1994) original analysis
of the language.
15 See also van der Hulst (2010) for a somewhat different view of syllable-internal
recursion, as well as a general discussion of recursion at different phonological
16 The prime diacritic is used here to show mora recursion, although in the literature
it is less commonly used for moras than for other recursive constituents.
17 This refers to so-called “non-cohering” affixes; differences between affix types are
discussed in more detail below.
18 Note that the enhancement is primarily perceptual, the effect being caused by a
reduction of the prominence of the other elements. There are also some different
stress patterns in compounds (e.g., Plag et al. 2008), but this does not alter the
main point here.
19 The treatment of marginal or more limited phenomena is interesting in its own
right; see among others Simon and Weise (2011) and Inkelas (2014).
20 There may also be cases where a higher constituent directly dominates a mora.
Anderson, S. (2005) Aspects of the theory of clitics. Oxford: Oxford University Press.
Antilla, A. (2002) “Morphologically conditioned phonological alternations”, NLLT,
20, pp. 1–42.
Basbøl, H. (1975) “Grammatical boundaries in phonology”, Aripuc, 9, pp. 109–135.
Basbøl, H. (1981) “On the function of boundaries in phonological rules” in Goyvaerts,
D. (ed.) Phonology in the 1980’s. Ghent: Story-Scientia, pp. 245–269.
Beckman, M. E., and Ayers, G. M. (1994) Guidelines for ToBI labelling. Online MS
and accompanying files.
Beckman, M. E., and Hirschberg, J. (1994) The ToBI annotation conventions. Online
Beckman, M., and Pierrehumbert, J. (1986) “Intonational structure in English and
Japanese”, Phonology Yearbook, 3, pp. 255–310.
56 Irene Vogel
Bertinetto, P. M. (1999) “Boundary strength and linguistic ecology”, Folia Linguistica,
33, pp. 267–286.
Bickel, B., Hildebrandt, K., and Schiering, R. (2009) “The distribution of phono-
logical word domains” in Grijzenhout, J., and Kabak, B. (eds.) Phonological
domains: Universals and deviations. Berlin: Mouton de Gruyter, pp. 47–75.
Booij, G. (1985) “Coordination reduction in complex words: A case for prosodic
phonology” in Hulst, H. van der, and Smith, N. (eds.) Advances in non-linear phon-
ology. Dordrecht: Foris, pp. 143–160.
Booij, G. (1996) “Cliticization as prosodic integration: The case of Dutch”, Linguistic
Review, 13, pp. 219–242.
Booij, G. (1999) “The role of the prosodic word in phonotactic generalizations” in
Hall, T. A., and Kleinhenz, U. (eds.) Studies on the phonological word. Philadelphia,
PA: John Benjamins, pp. 47–72.
Booij, G. (2007 [2005]) The grammar of words. Oxford: Oxford University Press.
Carter, R. T. Jr. (1974) Teton Dakota phonology. Ph.D. Diss., University of New
Mexico. (Published as University of Manitoba Anthropology Papers 10.)
Chomsky, N., and Halle, M. (1968) Sound pattern of English. Cambridge, MA:
MIT Press.
Cinque, G. (1993) “A null theory of phrase and compound stress”, Linguistic Inquiry,
24, pp. 239–297.
Czaykowska-Higgins, E., and Kinkade, M. D. (eds.) (1998) Salish languages and lin-
guistics: Theoretical and descriptive perspectives. Berlin: Mouton De Gruyter.
Dehé, N., Feldhausen, I., and Ishihara, S. (2011) “The prosody–syntax interface: Focus,
phrasing, language evolution”, Lingua, 121(13), pp. 163–169.
Dixon, R. M. W., and Aikhenvald, A. Y. (2002) “Word: A typological framework” in
Dixon, R. M.W., and Aikhenvald, A. Y. (eds.) Word. Cambridge: University Press,
pp. 1–41.
Downing, L. J. (1999) “Prosodic stem ≠ prosodic word in Bantu” in Hall, T. A., and
Kleinhenz, U. (eds.) Studies on the phonological word. Philadelphia, PA: John
Benjamins, pp. 73–98.
Elfner, E. (2012) Syntax-Prosody Interactions in Irish. PhD Dissertation. University of
Elordieta, G. (1997) “Accent, tone and intonation in Lekeitio Basque” in Martínez-
Giland, F., and Morales-Front, A. (eds.) Issues in the phonology and morphology
of the major Iberian languages. Washington, DC: Georgetown University Press,
pp. 4–78.
Elordieta, G. (2007) “Minimum size constraints on intermediate phrases” in
Proceedings of the 16th International Congress of Phonetic Sciences, Saarbrucken,
pp. 1021–1024.
Gussenhoven, C. (2004) The phonology of tone and intonation. Cambridge: Cambridge
University Press.
Gussenhoven, C. (2005) “Procliticized phonological phrases in English: Evidence from
rhythm”, Studia Linguistica, 59, pp. 174–193.
Haider, H. (1993) Deutsche syntax, generativ. Tübingen: Gunter Narr.
Hall, T. A. (1999) “The phonological word: a review” in Hall, T. A., and Kleinhenz,
U. (eds.) Studies on the phonological word. Philadelphia, PA: John Benjamins,
pp. 1–22.
Hayes, B. (1989) “The prosodic hierarchy in meter” in Kiparsky, P., and Youmans, G.
(eds.) Rhythm and meter. Orlando, FL: Academic Press, pp. 201–260.
58 Irene Vogel
Kaisse, E. (1985) Connected speech: The interaction of syntax and phonology. New York:
Academic Press.
Kaisse, E. M., and Shaw, P. (1985) “On the theory of lexical phonology”, Phonology
Yearbook, 2, pp. 1–30.
Kanerva, J. (1990) “Focusing on phonological phrases in Chichewa” in Inkelas, S., and
Zec, D. (eds.) The phonology-syntax connection. Chicago: University of Chicago
Press, pp. 145–161.
Kisseberth, C. (1994) “On domains” in Cole, J., and Kisseberth, C. (eds.) Perspectives
in phonology. Stanford, CA: CSLI, pp. 133–166.
Klavans, J. (1982) Some problems in a theory of clitics. Ph.D. Diss., University College
Klavans, J. (1985) “The independence of syntax and phonology in cliticization”,
Language, 61, pp. 95–120.
Ladd, D. R. (1986) “Intonational phrasing: The case for recursive prosodic structure”,
Phonology, 3, pp. 311–340.
Ladd, D. R. (1996/2008) Intonational phonology. Cambridge Studies in Linguistics 79.
Cambridge: Cambridge University Press.
Loporcaro, M. (1999) “Teoria fonologica e ricerca empirica sull’italiano e i suoi
dialetti. Fonologia e morfologia dell’italiano e dui dialetti d’Italia” in Benincà, P.,
Mioni, A., and Vanelli, L. (eds.) Atti del 31º Congresso della Società di Linguistica
Italiana. Roma: Bulzoni, pp. 117–151.
McCarthy, J. J. (1979) “On stress and syllabification”, Linguistic Inquiry, 10, pp.
Nespor, M., and Vogel, I. (1982) “Prosodic domains of external sandhi rules” in Hulst,
H. van der, and Smith, N. (eds.) The structure of phonological representations.
Dordrecht: Foris, pp. 224–255.
Nespor, M., and Vogel, I. (1986/2007) Prosodic phonology. Dordrecht: Foris.
Odden, D. (1987) “Kimatuumbi phrasal phonology”, Phonology Yearbook, 4,
pp. 13–36.
Odden, D. (1996) The phonology and morphology of Kimatuumbi. The Phonology of
the World’s Languages. Oxford: Clarendon Press.
Odden, D. (2000) “The phrasal tonology of Zinza”, Journal of African Languages and
Linguistics, 21, pp. 45–75.
Patterson, T. A. (1990) Theoretical aspects of Dakota morphology and phonology.
Ph.D. Diss., University of Illinois Urbana-Champaign.
Peperkamp, S. (1997) Prosodic words. HIL Dissertation Series 34. The Hague: Holland
Academic Graphics.
Plag, I., Kunter, G., Lappe, S., and Braun, M. (2008) “The role of semantics, argument
structure, and lexicalization in compound stress assignment in English”, Language
84, pp. 760–794.
Scheer, T. (2010) A guide to morphosyntax-phonology theories: How extra-phonological
information is treated in phonology since Trubetzkoy’s Grenzsignale. Berlin: De
Gruyter Mouton.
Schiering, R., Bickel, B., and Hildebrandt, K. (2010) “The prosodic word is not uni-
versal, but emergent”, Journal of Linguistics, 46(3), pp. 657–709.
Schiering, R., Hildebrandt, K., and Bickel, B. (2007) “Cross-linguistic challenges
for the prosodic hierarchy: Evidence from word domains”. MS, University of
Selkirk, E. (1972) The phrase phonology of English and French. Outstanding
Dissertations in Linguistics. New York: Garland Publishing.
60 Irene Vogel
Vogel, I. (1977) The syllable in phonological theory: With special reference to Italian.
Ph.D. Diss., Stanford University.
Vogel, I. (1982) La Sillaba come Unità Fonologica. [The syllable as phonological unit].
Bologna: Zanichelli.
Vogel, I. (1986) “External sandhi rules operating between sentences” in Andersen, H.
(ed.) Sandhi phenomena in the languages of Europe. Berlin: Mouton de Gruyter,
pp. 55–64.
Vogel, I. (1999) “Subminimal constituents in prosodic phonology” in Hannahs, S. J.,
and Davenport, M. (eds.) Phonological structure. Dordrecht: Foris, pp. 251–269.
Vogel, I. (2008a) “The morphology-phonology interface: Isolating to polysynthetic
languages”, Acta Linguistica Hungarica, Special issue, 55(1), pp. 1–22.
Vogel, I. (2008b) “Universals of prosodic structure” in Scalise, S., Magni, E., Vineis,
E., and Bisetto, A. (eds.) Universals of language today. Amsterdam: Springer,
pp. 59–82.
Vogel, I. (2009) “The status of the Clitic Group” in Grijzenhout, J., and Kabak, B.
(eds.) Phonological domains: Universals and deviations. Berlin: Mouton de Gruyter,
pp. 15–46.
Vogel, I. (2010) “The phonology of compounding” in Scalise, S., and Vogel, I. (eds.)
Compounding: Theory and analysis. Amsterdam: John Benjamins, pp. 145–163.
Vogel, I. (2012) “Recursion in phonology?” in Bert, B., and Noske, R. (eds.)
Phonological explorations: Empirical, theoretical and diachronic issues. Berlin/
Boston: De Gruyter, pp. 41–61.
Vogel, I., and Raimy, E. (2002) “The acquisition of compound vs. phrasal stress in
English”, Journal of Child Language, 29(2), pp. 225–250.
Watson, J. C. E. (2002) The phonology and morphology of Arabic. Oxford: Oxford
University Press.
Watson, J. C. E. (2011) “Word stress in Arabic” in van Oosterdorp, M., Ewen, C.
J., Hume, E. V., and Rice, K. (eds.) Blackwell companion to phonology, vol.
5. Oxford: Wiley-Blackwell, pp. 2990–3019.
Wiese, R. (1996) The phonology of German. Oxford: Clarendon Press.
Zwicky, A. (1984) “Clitics and particles”, Ohio State Working Papers in Linguistics,
29, pp. 148–173.
The Revised Max Onset
Syllabification and stress in English
San Duanmu
62 San Duanmu
It is generally agreed that every syllable should have a possible onset and a
possible coda, to be specified shortly. Thus, no analysis proposes (2a), because
[kstr] is not a possible coda. Similarly, no analysis proposes (2e), because [kstr]
is not a possible onset. But opinions differ on how to create possible onsets
and codas, as seen in (2c)–(2d).
Analysis (2b) is proposed by Hoard (1971), based on two assumed
requirements: (i) the onset of a stressed syllable should be maximized (Max
Stressed Onset) and (ii) the coda should be maximized (Max Coda). In extra,
the second syllable has no stress, which means it need not maximize its onset,
and so the first syllable takes all the consonants it can as its coda, leaving only
/r/to the second syllable. A similar analysis is proposed by Bailey (1978) and
Wells (1990).
Analysis (2c) is proposed by Lowenstamm (1981), who assumes that the
onset should be maximized for all syllables (Max Onset), plus the require-
ment that consonants in the onset should have increasing sonority. Following
Jespersen (1904), Lowenstamm assumes the sonority scale ‘vowel > glide >
sonorant > fricative > stop’, where a vowel has the greatest sonority and a
stop has the least. According to the scale, the sequence /st/does not have
increasing sonority; therefore, /st/cannot fit into an onset but must split
between two syllables, as shown.
(2d) is proposed by Pulgram (1970), who also assumes Max Onset for all
syllables but without the sonority requirement. Thus, the onset of the second
syllable is [str].
Let us consider another example. The English word whiskey /wɪski/has four
proposed analyses, shown in (3). In (3c), [s]is ‘ambisyllabic’, which means it
belongs to both the first syllable and the second, so that the first syllable is
[wɪs] and the second is [ski].
Several comments are in order. First, the LOI applies to the ‘body’ of a syl-
lable, which includes the main vowel. This way the LOI can rule out syllables
like [sfæt] and [sfɛn] correctly, because no word starts with [sfæ] or [sfɛ]. If
the LOI only applies to the onset, then [sfæt] and [sfɛn] would satisfy the LOI
(contrary to the judgment of native intuition), because the onset [sf] is found
in sphere. Second, the LOF applies to the rime of the syllable, which includes
the main vowel. This way the LOF can rule out a syllable like [kæ], because
no word ends in the rime [æ]. If the LOF only applies to the coda, then a
syllable like [kæ] would satisfy the LOF, because it simply lacks a coda, and
many words end with no coda. Third, the LOI and the LOF apply to the sur-
face form of a word. For example, the surface form of Canada is [kænədə].
64 San Duanmu
If we syllabify it as [kæn][ə][də], then both the LOI and the LOF are satis-
fied. However, if the LOF applies to the underlying form of Canada, which
according to Chomsky and Halle (1968) is [kænædə], where the first two
vowels are both [æ], then [kæn][æ][də] would violate the LOF, because the
second syllable ends in [æ], yet no English word does.
To illustrate the application of the LOI and the LOF, consider various
ways to syllabify the word extra, analyzed in (6). When the LOI or the LOF
is violated, an asterisk is shown. When the LOI or the LOF is satisfied, a
check mark is shown, and a sample word is given in parentheses, with relevant
sounds underlined.
(7) LOI and LOF in the syllabification of whiskey (in American English)
Syllabification LOF LOI
a. [wɪ][ski] * ✓ (scheme)
b. [wɪsk][i] ✓ (risk) ✓ (east)
c. [wɪ[s]ki] ✓ (miss) ✓ (keen)
d. [wɪs][ki] ✓ (miss) ✓ (keen)
In Debra, the second syllable has no stress. For Hoard (1971), the coda of
the first syllable should be maximized, yielding [dɛb][rə], which satisfy both
the LOF and the LOI. For Lowenstamm (1981), [br] is a good onset, because
it has increasing sonority, yielding [dɛ][brə], where [dɛ] violates the LOF,
because no word ends in [ɛ]. Similarly, the analysis of Halle and Vergnaud
(1987) violates the LOF. Finally, the analysis of Pulgram (1970) satisfies both
the LOI and the LOF.
In essay, the second syllable has secondary stress. For Hoard (1971),
its onset should be maximized, yielding [ɛ][sei], where [ɛ] violates the LOF.
Similarly, the analyses of Lowenstamm (1981) and Halle and Vergnaud (1987)
violate the LOF. For Pulgram (1970), ‘possible rime’ requires the first syllable
to be [ɛs], yielding [ɛs][ei], which satisfies both the LOI and the LOF.
In summary, while all analyses assume some version of Max Onset, only
Pulgram’s version observes the LOF. Let us redefine the two versions in (10)
and call them Max Onset and Revised Max Onset.
Given the new definitions, Hoard (1971) assumes Max Onset for stressed
syllables and Max Coda otherwise. Lowenstamm (1981) assumes Max Onset,
with an additional requirement for a consonant sequence to have increasing
66 San Duanmu
sonority in the onset. Halle and Vergnaud (1987) assume Max Onset. Finally,
Pulgram (1970) assumes Revised Max Onset, which also ensures that all rimes
are possible.
Let us now consider syllable weight, which is based on the length of the
rime. A syllable is light if the rime consists of a short vowel without a coda;
otherwise, the syllable is heavy. In English, a long vowel is one that can end
a stressed syllable. In American English, long vowels include [iː uː ei ou ai au
oi ɑː ɒː ɝː], as in see, two, day, go, buy, how, boy, spa, law, and fur respectively.
A short vowel is one that cannot end a stressed syllable, such as [ɪ ʊ ɛ ʌ], as in
sit, book, bed, and bud, or one that is unstressed only, such as [ə ɚ]. The vowel
[æ] is usually thought to be short as well (Chomsky and Halle 1968), although
it is phonetically long and does occur in some marginal words, such as nah
[næː]. Finally, unstressed word final [i u] are sometimes treated as short (Halle
and Vergnaud 1987). In (11) we summarize vowel length in American English.
Given the definition of syllable weight and vowel length, it is clear that
different ways of syllabification lead to different weight patterns. Consider
the word whiskey, whose syllabification and weight patterns are shown in (12).
For visual clarity, a hyphen is added between syllables in the columns under
Rime and Weight. In addition, H and L are shorthand notations for heavy
and light syllables respectively.
In (12), [ɪsk] and [ɪs] are both called heavy, although [ɪsk] has an extra con-
sonant. To distinguish them, VCC (such as [ɪsk]) and VVC (such as [aun] in
council) are sometimes called ‘super-heavy’, in contrast to VC and VV, which
are regular heavy. However, the distinction is of little consequence for our dis-
cussion and is not made here.
(13) Main stress in English nouns (Halle and Vergnaud 1987: 227):
Main stress is on the penultimate syllable if it is heavy (e.g., agenda,
Else main stress is on the antepenultimate syllable (e.g., Canada, Mexico)
To obtain the proposed stress pattern, Halle and Vergnaud (1987) propose
an ordered set of rules, which we rephrase in (14), where H is a heavy syl-
lable, L is a light syllable, and parentheses over H or L indicate foot bound-
aries. A general assumption in metrical phonology is that every foot has stress
(either primary or secondary) and every stress implies a foot. In a trochaic
foot with two syllables, stress falls on the one on the left.
(14) Ordered rules for assigning main stress in English nouns (Halle and
Vergnaud 1987)
a. Syllabify according to Max Onset.
b. Exclude the final syllable (if the word has two or more syllables).
c. Build a trochaic foot from the right, which can be (H), (HL), or (LL).
d. Else build (L) instead.
In (15) we show the analysis of some English nouns, both regular ones and
exceptional ones, where * indicates a violation of a rule in (14). Halle and
Vergnaud (1987) consider word final [i]to be short in some words, such as
city, which need not concern us.
68 San Duanmu
The first six words are regular and the last four exceptional. In Tennessee
and Japan, (14b) fails to exclude the final syllable, which acquires main stress.
In banana, (14c) fails to build (LL); as a result, (14d) builds (L) instead. In tex-
tile, both syllables have stress, where the first has main stress and the second
has secondary stress. This means that (14b) fails to exclude the final syllable
(because excluded syllables cannot be assigned stress). In addition, (14c) fails
to assign main stress to the final syllable; instead, main stress appears on the
preceding syllable. It is worth noting, too, that although lemon and city are
thought to be regular words, their foot (L) is in fact exceptional, because it is
not among the preferred feet in the first step of foot construction (14c). We
shall return to this point.
Hayes (1995) offers a similar analysis, except that he only assumes two
regular foot types, (H) and (LL), each having two moras. His analysis is
rephrased in (16) and illustrated in (17).
(16) Rules for assigning main stress in English nouns (Hayes 1995)
a. Syllabify according to Max Onset.
b. Exclude the final syllable (if the word has two or more syllables).
c. Build a moraic trochee from the right, which can be (H), or (LL).
d. Else build (L) instead.
70 San Duanmu
72 San Duanmu
It is worth noting that there is no stressed L. This means that, unlike Halle
and Vergnaud (1987) and Hayes (1995), for whom (L.L) is a possible foot,
in the present analysis it is not. The present analysis agrees with two facts.
First, in Chinese, where syllable boundaries are clear, no L can carry stress
or tone. Second, in English no stressed final syllable is L, even though both
Halle and Vergnaud (1987) and Hayes (1995) allow L to be an exceptional
foot. Moreover, as we have seen above, while syllable boundaries are not
always obvious in English, Revised Max Onset can ensure that all stressed
syllables are H.
It is also worth noting that, in (HH), there is no stress clash, because at
the moraic level, the two stresses are separated by an unstressed mora. In
addition, by treating (HH) as a regular foot, we avoid a problem in previous
analyses. Specifically, in Halle and Vergnaud (1987), for words like alpine and
moron, main stress is assigned to the second syllable, and then a special rule
is used to shift the stress to the left. Similarly, Burzio (1994) has to make the
unusual claim that the second syllable in words like alpine and moron has no
secondary stress, contrary to many other people’s judgment. In the present
analysis, such words need no special treatment.
The proposed foot structures can be derived from two well- known
constraints, Foot Binarity and the Weight-Stress Principle, shown in (21),
along with Revised Max Onset, Parse2, Main Stress, and Null Beat, to
account for syllabification and word stress in English.
Foot Binarity requires a moraic foot to contain two moras and a syllabic
foot to contain two syllables (Prince 1980). The WSP has two parts. The first
part is similar to what Prince (1992) calls the Weight-to-Stress Principle, which
requires H to be stressed. The second part is similar to what Prince (1992)
calls the Stress-to-Weight Principle, which excludes (m.m) or (LL) from being
a possible foot, because there is a stressed L. Prince (1992) rejects the second
part of the WSP, in part because many English words, such as sanity, banana,
and city, seem to have a stressed L. However, as I have shown, the problem
arises from Max Onset. If we assume Revised Max Onset instead, then both
parts of the WSP can be maintained.
Parse2 requires every heavy syllable to form a moraic foot and have
stress, because it contains two moras (two moraic beats). In addition, Parse2
disallows two adjacent free syllables (two syllabic beats). On the other hand,
The analysis shows that the same CV string, such as CVCVCV in Canada
and banana, can satisfy the constraints in more than one way and yield more
than one good solution. It can be shown, too, that every English word has at
least one way to satisfy all the constraints.
74 San Duanmu
The LOI, the LOF, the WSP, and FtBin have been discussed above. No
Marking aims to minimize exceptional or marked words. The evaluation of
various approaches to syllabification and stress assignment is shown in (24),
where HV refers to Halle and Vergnaud (1987).
As discussed above, Max Onset ignores the LOF, because it creates stressed
light syllables, such as the first syllable in Canada [kæ][nə][də] and very [vɛ][ri],
which are not found in word-final positions. In addition, such stressed light
syllables violate the WSP. In contrast, RMO always satisfies the LOI, the LOF,
and the WSP. There are two reasons. First, word-initial vowels are common,
which means that syllables without an onset can still satisfy the LOI. Second,
stressed word-final syllable are always heavy and satisfy the WSP, and conse-
quently, the LOF requires stressed non-final syllables to be syllabified in the
same way, which means they always satisfy the WSP, too.
Next we consider stress assignment and foot structure. First, in the deter-
ministic approach, both Halle and Vergnaud (1987) and Hayes (1995) assume
Max Onset, which violates the LOF and the WSP, as just discussed. In addition,
because they assume the exclusion of the final syllable, words like very and city
will end up with just one short syllable, which is made into a foot by itself, which
violates FtBin, regardless of whether we assume moraic feet (Hayes 1995) or
syllabic feet (Halle and Vergnaud 1987). Finally, in the deterministic approach,
some words are regular and some exceptional, which violates No Marking.
Although Burzio (1994) assumes a non-deterministic approach, he assumes
Max Onset, too. Therefore, his analysis violates the LOF. In addition, to make
sure that words like city, very, and disco have a stressed heavy syllable, as
required by the foot (Hσ), these words have to be marked with an underlying
geminate consonant, which violates No Marking.
If we syllabify according to Max Onset, the LOF is violated by the first two
syllables. If we syllabify according to Max Coda, the LOF is satisfied, but the
second syllable causes a problem for stress assignment: It is H, yet it does not
attract stress. If we maximize the coda of the first syllable only (and maximize
the onset of other syllables), the second syllable still violates the LOF. In sum-
mary, given Chomsky and Halle’s analysis of underlying forms, if syllabifica-
tion precedes stress assignment, there is no way to satisfy the LOF, without
causing problems for stress assignment.
A solution is available if we give up the assumption that syllabification
precedes stress assignment, and assume instead that they can be evaluated
76 San Duanmu
simultaneously. The solution is made possible in a constraint-based ana-
lysis (Prince and Smolensky 1993). For illustration, consider the analysis
of the string CVCVCV, which represents words like Canada, banana, Sicily,
committee, and so forth. Assuming the constraints discussed earlier, possible
syllabifications and foot structures of this string are shown in (27), where
Main refers to the requirement for main stress to fall on a syllabic foot.
(27) Possible analyses of CVCVCV: many good solutions and many bad ones
CVCVCV FtBin WSP RMO Parse2 Main
[CVC][ə][Cə] (HL)L ✓ ✓ ✓ ✓ ✓
[Cə][CVC][ə] L(HL) ✓ ✓ ✓ ✓ ✓
[CVC][ə][Cə] (mm)LL ✓ ✓ ✓ * *
*[Cə][CV][Cə] L(LL) ✓ * * ✓ ✓
*[CV][Cə][Cə] (LL)L ✓ * * ✓ ✓
*[CV][Cə][Cə] (L)LL * * * * *
Of the six options shown, only two satisfy all the constraints, represented
by Canada for (HL)L and banana for L(HL). The other four analyses violate
one or more of the constraints. It is worth noting that it is of little conse-
quence whether Canada has an underlying form [kænædə], as proposed by
Chomsky and Halle (1968), or whether it is simply [kænədə], as proposed by
Burzio (1996). Similarly, let us consider another string CVCCVV, shown in
(28), where VV is a long vowel or diphthong.
(28) Possible analyses of CVCCVV: many good solutions and many bad ones
CVCCVV FtBin WSP RMO Parse2 Main
[CVC][CVV] (HH) ✓ ✓ ✓ ✓ ✓
[Cə][CCVV]Ø L(HL) ✓ ✓ ✓ ✓ ✓
[CVC][CVV]Ø (mm)(HL) ✓ ✓ ✓ ✓ ✓
*[CVC][CVV] (mm)(mm) ✓ ✓ ✓ * *
*[CVCC][VV] (HH) ✓ ✓ * ✓ ✓
*[CV][CCVV] (LH) ✓ * * ✓ *
*[Cə][CCVV] L(mm) ✓ ✓ ✓ ✓ *
*[CVC][CVV]Ø H(HL) ✓ * * ✓ ✓
*[CVCC][VV]Ø (mm)(HL) ✓ ✓ * ✓ ✓
Of the various options, just three are good, (HH) as in disco, L(HL) as in
supply, and (mm)(HL) as in Bantu. Let us consider why other options are not
2.5 Conclusions
I have shown that Max Onset, a widely used rule for syllabification, satis-
fies the Law of Initials (LOI) but violates the Law of Finals (LOF). In con-
trast, the Revised Max Onset (RMO) satisfies both. In addition, Max Onset
creates stressed light syllables and violates the Weight-Stress Principle (WSP),
whereas RMO does not.
I have shown, too, that Max Onset is the only option in a derivational
approach to phonology (e.g., Halle and Vergnaud 1987), where a word under-
goes a set of ordered rules, first those for syllabification and then those for
stress assignment. In contrast, in a constraint-based approach to phonology,
where syllabification and stress assignment can be evaluated simultaneously,
RMO becomes possible.
I have also compared two approaches to stress assignment. In the deter-
ministic approach (e.g., Halle and Vergnaud 1987; Hayes 1995; Hammond
1999), some words are thought to be regular and others exceptional. In con-
trast, in the non-deterministic approach, all words are regular and no word
is exceptional. The non-deterministic approach is achieved by keeping the
constraints that are observable by all words, and leaving out the constraints
that are violated by ‘exceptional’ words. For example, in the deterministic
approach, there is a requirement to skip the final syllable, which is observed
78 San Duanmu
by Canada but violated by Japan. In the non-deterministic approach, there is
no such requirement, and a word form can choose to skip the final syllable,
as Canada does, or keep it, as Japan does. Both approaches agree that English
word stress is not completely predictable and lexical markings are required.
In the deterministic approach, the markings indicate which words are regular
and which exceptional. In the non-deterministic approach, the markings indi-
cate which way a word chooses to satisfy the constraints.
The present analysis shows that some phonological constraints are much
stronger than previously thought. For example, RMO ensures that every
stressed syllable is heavy, which supports the second part of the WSP. that
is, not only must heavy syllables be stressed (a point Prince 1992 argues for),
but also light syllables must be unstressed (a point Prince 1992 believes to
be frequently violated). Similarly, in the deterministic approach, where the
final syllable is skipped, Canada has a binary foot, but banana does not. In
the present approach, both Canada and banana have a binary foot, and so do
all other words. Thus, contrary to a central claim in Optimality Theory that
all constraints are in principle violable (Prince and Smolensky 1993), some
constraints do not seem to be so. The present study intends to show that such
constraints merit greater attention than they have received.
Bailey, C.-J. N. (1978) Gradience in English syllabification and a revised concept of
unmarked syllabification. Bloomington: Indiana University Linguistics Club.
Burzio, L. (1994) Principles of English stress. Cambridge: Cambridge University
Burzio, L. (1996) “Surface constraints versus underlying representation” in Durand,
J., and Laks, B. (eds.) Current trends in phonology: Models and methods, vol. 1.
Salford: European Studies Research Institute, University of Salford Publications,
pp. 123–141.
Chomsky, N., and Halle, M. (1968) The sound pattern of English. New York: Harper
and Row.
Duanmu, S. (2007) The phonology of standard Chinese, 2nd edition. Oxford: Oxford
University Press.
Eddington, D., Treiman, R., and Elzinga, D. (2013) “Syllabification of American
English: Evidence from a large-scale experiment Part I”, Journal of Quantitative
Linguistics, 20(1), pp. 45–67.
Halle, M., and Vergnaud, J.-R. (1987) An essay on stress. Cambridge, MA: MIT Press.
Hammond, M. (1999) The phonology of English: A prosodic optimality theoretic
approach. Oxford: Oxford University Press.
Hayes, B. (1995) Metrical stress theory: Principles and case studies. Chicago: University
of Chicago Press.
Hoard, J. E. (1971) “Aspiration, tenseness, and syllabification in English”, Language,
47(1), pp. 133–140.
Jespersen, O. (1904) Lehrbuch der Phonetik. Leipzig; Berlin: Teubner.
Kahn, D. (1976) Syllable-based generalizations in English phonology. Ph.D. Diss.,
Massachusetts Institute of Technology.
3.1 Introduction
The Fuzhou dialect is the representative dialect of the Eastern Min dialect
group of Chinese. Fuzhou has a complex phonological system, and the com-
plexity lies in the fact that sound changes may occur to the initials, finals,2
and tones of all the participating syllables in a string of sounds (cf. Chen and
Norman 1965; Chan 1985; Chen 1998; Li 2002, among others). Before we
proceed to the discussion about enclitics and the clitic group composed of
“host+enclitic” in Fuzhou, let us first go over a brief introduction to Fuzhou
phonological phenomena relevant to the discussion in this chapter.
The first Fuzhou phonological phenomenon examined here is
Phonological Tone Sandhi (henceforth TS). TS stipulates that the citation
tone of a non-final syllable is changed into a sandhi tone depending on its
original tonal value and that of the tone of the following syllable within a
given domain (cf. Chen and Norman 1965; Chan 1980, 1985; Wright 1983;
Shih 1986; Hung 1987; Zhang 1992; Chan 1998; Chen 1998; You 2017,
among others). It has long been noticed that TS may apply to lexical items,
as in (1), and phrases consisting of independent words, as in (2). Citation
forms are presented on the left of “→”, while sandhi forms are presented
on the right. For the sake of brevity, only sandhi forms of tones (marked in
bold) are presented here.
82 Shuxiang You
From the examples in (1–8), we can find that in the domains formed by
lexical items and phrases in the Fuzhou dialect, both TS and CL apply in
some strings of sounds, while they are blocked in others. As we will see in
the following sections, the domain formed by the clitic group consisting of
“host+enclitic” is quite different from the domains formed by lexical items
and phrases, in terms of the application/blocking of TS and CL.
The following sections are organized as follows. Section 3.2 presents an
introduction to clitics in general and the clitic group in prosodic phonology.
Section 3.3 identifies Fuzhou enclitics and explores their morphosyntactic
functions. Section 3.4 examines the phonological properties of the clitic
group composed of “host+enclitic” in Fuzhou with respect to the application/
blocking of TS and CL. Section 3.5 discusses the violation of the Strict Layer
Hypothesis (SLH) caused by the clitic group consisting of “host+enclitic” in
Fuzhou. Section 3.6 concludes this study.
84 Shuxiang You
a hierarchically arranged organization called Prosodic Structure between the
morphosyntactic and phonological components. A given string of sounds is
organized into a series of hierarchically arranged prosodic constituents, with
each prosodic constituent serving as the domain of application for specific
phonological rules and phonetic processes. Thus phonological operations do
not refer to syntactic constituents in a direct way but instead to the already
created prosodic constituents. Hence, the existence of phonological rules and
phonetic processes that make reference to a particular prosodic constituent
is viewed as one significant motivation for the establishment of the prosodic
constituent itself in a given language.
The earliest prosodic hierarchy proposed by Selkirk (1978/1981) contains
only the syllable, the foot, the phonological/prosodic word,8 the phonological
phrase, the intonational phrase, and the utterance. Hayes (1984/1989) and
N&V (1986) added and inserted the clitic group between the phonological
word and the phonological phrase, and Zec (1988) proposed the mora (μ),
the lowest constituent in the hierarchy. Prosodic constituents are defined by
making use of different types of phonological and non-phonological infor-
mation. According to the types of information to which different constituents
are sensitive, Zhang (1992, 2017) proposed a trisected model for prosodic
hierarchy, as given in Figure 3.1.
The only well formedness condition on prosodic constituency is laid down
in the SLH, formulated in Selkirk (1984), stipulating that in the prosodic
hierarchy, a prosodic constituent of a given level n immediately dominates
only constituents of the lower level n-1, and is exhaustively contained in a
constituent of the immediately higher level n+1. In responding to evidence
and criticisms that have challenged the SLH, Selkirk (1996) has factored out
the SLH into four more primitive constraints within the framework of the
Optimality Theory, as given in (9), among which Layeredness and Headedness Definition of the clitic group and evidence for the clitic group domain
Based on the observation that certain phonological generalizations only apply
within the domain consisting of a word host and the clitic(s) in languages, the
string of the host plus the clitic(s) is treated as a unique prosodic constituent
in the prosodic hierarchy. This constituent is referred to as the clitic group, as
defined in (10).
Like other prosodic constituents, the clitic group has been reported to form
the domain for many phonological phenomena cross-linguistically, which
constitutes the most substantial evidence for the existence of this constituent.
A typical case is Stress Assignment in Latin. According to N&V (1986), the
86 Shuxiang You
clitic group is a domain for this rule. Specifically, when an enclitic is attached
to a word, the primary stress is shifted from its original position within the
word to the syllable that immediately precedes the clitic, as exemplified in (11),
in which -que ‘and’, interrogative -ne, and -cum ‘with’ are all enclitics.
(11) a. vírum ‘the man (acc.)’ virúmque ‘and the man (acc.)’
b. vídēs ‘you see’ vidḗsne? ‘do you see?’
c. cum vóbis ‘with you (pl.)’ vobíscum ‘with you (pl.)’
88 Shuxiang You
c. [[我]ω 其C]CG
[[ŋuai] ki]
In addition, similar to zhe 着, the durative aspect marker 𠲥 [lɛ0] can occur
between two verbs. In the “V1 𠲥 V2” construction, 𠲥 [lɛ0] attaches to the pre-
ceding verb (V1) and indicates that the event denoted by the following verb
(V2) happens in the state of “V1-ing”, as in (17). Moreover, 𠲥 [lɛ0] can be used
in an imperative sentence, as in (18).
90 Shuxiang You
92 Shuxiang You
c. 伊 会 买 [[卵糕]ω 𣍐C]CG ?
ʔi ʔa mɛ [[louŋ ko] ma]
he will buy cake Qu
‘Will he buy a cake?’
c. [[等]ω 遘C]CG 十 点
[[tiŋ] kau] seiʔ teiŋ
wait PVP ten o’clock
‘to wait until ten o’clock’
d. [[做]ω 遘C]CG 逢侬 都 满意
[[tso] kau] xuŋ nøyŋ tu muaŋ ʔei
do PVP everyone all satisfied
‘to do (something) and make everyone satisfied’
94 Shuxiang You
3.3.8 Summary
From data presented in Section 3.3, we can find that these enclitic-like elem-
ents in the Fuzhou dialect share some of the most common morphosyntactic
properties of enclitics across languages: (a) they all belong to functional cat-
egories; (b) they never occur as the only element of an utterance and must
attach to the adjacent prosodic unit (ω or CG) on the left as the host; (c) the
meaning of the string of the host plus the enclitic is predictable from the
meaning of the host and that of the enclitic; and (d) they can attach to
material already containing the affix, as in (12b) and (29d), or the clitic, as in
(23) and (30). Therefore, according to the discussion in Section 3.2.1, it is rea-
sonable to consider these elements as enclitics. The group of the host plus the
enclitic thus forms a type of clitic group in this dialect. In Section 3.4, we will
see that there are phonological phenomena characteristic only of such a type
of clitic group in Fuzhou, which provides further evidence for the existence
of enclitics and the clitic group consisting of “host+enclitic” in this dialect.
We can find that in (31a), TS applies between 旧 ‘old’ and 书 ‘book’ and
changes the tone of 旧 ‘old’, while TS is blocked in (31b), although these two
examples have similar morphosyntactic structure, namely the modifier-head
Some linguists suggest that the blocking of TS in cases like (31b) can be
ascribed to the neutral tone carried by elements like 其 [ki0] (e.g., Chan 1985;
Li 2002; among others). Nevertheless, notice that 囇 [la242] in (31c) bears a non-
neutral tone but also causes the blocking of TS, showing that the blocking of
TS cannot be simply ascribed to the tonal value.
Elements like 其 [ki0] and 囇 [la242] that can trigger the blocking of TS are
enclitics, according to the discussion in Section 3.3. Hence I assume that the
clitic group composed of “host+enclitic” in Fuzhou cannot form the domain of
application for TS. Specifically, TS is blocked between the host and the enclitic.
This assumption is well supported by Fuzhou data, as illustrated in (32–38).
96 Shuxiang You
*touŋ 51
[[ʔy51] lau31]
fall rain CRS
‘It is raining.’
VI. Host+delimitative aspect marker 囇 [la242]
a. [[坐]ω 囇C]CG
[[soy242] la242]
→ [[soy ] # la242]
*[[soy51] la242]
sit DLM
‘to sit awhile’
98 Shuxiang You
(39) a. 买 锅 b.买 过
mɛ31 kuo44 mɛ31 kuo213
→ mɛ21 # kuo44 → mɛ31 ʔuo213
*mɛ21 ʔuo44 buy EXP
buy pan ‘to have bought (something)’
‘to buy a pan’
3.4.3 Summary
The application/blocking of TS and CL in lexical items, phrases, and the
clitic group domain composed of “host+enclitic” in Fuzhou is summarized
in Table 3.1. ‘√’ denotes the application of the rule, while ‘×’ indicates that the
rule is blocked even though there is an appropriate environment. ‘√/×’ signifies
that the application is not obligatory.
3.6 Conclusion
Based on the discussions in previous studies on clitics and the clitic group
across languages, this study presents a thorough investigation of enclitics and
the clitic group consisting of “host+enclitic” in Fuzhou, from the perspectives
of morphosyntactic functions and phonological behavior. The following
properties of enclitics and the clitic group consisting of “host+enclitic” in
Fuzhou have been identified:
Thus we can find that, on the one hand, enclitic-like elements in Fuzhou
reported in the literature are indeed enclitics, since they share common prop-
erties with enclitics in other languages. On the other hand, the group of
1 I would like to thank Prof. Hongming Zhang, who has persistently encouraged
and pushed me during my writing of this chapter. His valuable comments and
suggestions have greatly improved the quality of this chapter. An earlier version of
this chapter was presented at the 24th Columbia University Graduate Conference
on East Asia in New York, NY, February 2015. Thanks go to the audience at the
conference for various questions and comments. I would also like to thank my
informants, Mr. Dexing Chen, Mrs. Ling Chen, and Mrs. Liping Song, for their
patience and support during my fieldwork in Fuzhou in 2016 and their suggestions
and comments despite the physical distance ever since 2015. Of course, any
remaining errors in this chapter are mine.
2 Sound changes to finals in Fuzhou is a tonally conditioned phonological process –
it occurs in cases where tone sandhi occurs and is blocked whenever tone sandhi
is blocked (cf. Chen and Norman 1965; Chan 1985; Chen 1998, among others).
I assume that the domain of application for sound changes to finals should be the
same as the domain of application for tone sandhi. Hence, for the sake of brevity,
sound changes to finals will not be presented and discussed in this chapter.
3 The tone sandhi behavior of lexical items formed through reduplication like
(3) is conditioned by another Fuzhou rule, which is referred to as Morphological
Tone Sandhi in You (2017). Please see Chen and Norman (1965), Chen (1998),
You (2017), among others, for more details.
4 The complex tone sandhi behavior of phrasal-level constructions exhibited by the con-
trast between (2) and (4) has long been a problem for linguists. Readers are referred
to Chen and Norman (1965), Chan (1980, 1985), Wright (1983), Shih (1986), Hung
(1987), Zhang (1992), Chan (1998), You (2017), among others, for different analyses.
5 For detailed discussion on the blocking of CL in lexical items like (7), please see
You (2017).
6 The application/blocking of CL in phrasal-level constructions is another long-
standing problem for linguists. Please see Chen and Norman (1965), Chan (1985),
Shih (1986), You (2017), among others, for different analyses.
Booij, G. (1983) “Principles and parameters in prosodic phonology”, Linguistics,
21(1), pp. 249–280.
Booij, G. (1985) “The interaction of phonology and morphology in prosodic phon-
ology” in Gussmann, E. (ed.) Phono-morphology: Studies in the interaction of phon-
ology and morphology. Lublin: Katolicki Universytet Lubelski, pp. 23–34.
Booij, G. (1996) “Cliticization as prosodic integration: The case of Dutch”, Linguistic
Review, 13(3–4), pp. 219–242.
Chan, L.-L. L. (1998) Fuzhou tone sandhi. Ph.D. Diss., University of California
San Diego.
Chan, M. K.-M. (1980) Syntax and phonology interface: The case of tone sandhi in
the Fuzhou dialect of Chinese. MS Thesis, University of Washington.
Chan, M. K.-M. (1985) Fuzhou phonology: A non-linear analysis of tone and stress.
Ph.D. Diss., University of Washington.
Chen, L., and Norman, J. (1965) An introduction to the Foochow dialect. San
Francisco: San Francisco State College.
Chen, M. Y. (1985) The syntax of Xiamen tone sandhi. MS, University of California
San Diego.
Chen, M. Y. (1987) “The syntax of Xiamen tone sandhi”, Phonology Yearbook, 4, pp.
Chen, Z.- P. (1998) Fuzhou Fangyan Yanjiu [A study of the Fuzhou dialect].
Fuzhou: Fujian People’s Publishing.
Part II
Prosodic patterns
Geographical clines in the realization
of intonation in the Netherlands
Judith Hanssen, Carlos Gussenhoven,
and Jörg Peters
4.1 Introduction
Geography is one of the explanatory factors of phonetic variation in speech
(cf. Britain 2013). The realization of intonation contours has recently been
shown to follow a geographical cline from the southwest to the northeast
of the Netherlands, with a continuation to the low Saxon dialect of Weener
across the border in Germany (Peters et al. 2014, 2015). Earlier, Gilles
(2005: 165) suggested that pitch excursions of f0 falls in varieties of German
are larger in the west than in the east of Germany, on the basis of limited
data. In these cases, the variation concerns realizational differences in Ladd’s
(2008: 116) terms, that is, differences in the phonetic realization of compar-
able phonological forms.
The realization of intonation contours may differ in more general ways
than in function of contextual factors, like the segmental composition of the
accented syllable, upcoming word boundaries, or focus. An example is peak
timing in English, which is earlier than in Dutch and German, and later in
southern German than in northern German (Atterer and Ladd 2004; Ladd
et al. 2009; Mücke et al. 2009), while Kügler (2007) reported later f0 peaks in
the southern Swabian variety of German than in the eastern Upper Saxon
variety. Dialectal variation in tonal timing has also been reported for var-
ieties of Lowland Scots1 (van Leyden 2004), German (Peters 1999; Gilles
2005), American English (Arvaniti and Garding 2007), Irish (Kalaldeh et al.
2009), and British English (Ladd et al. 2009). Second, pitch excursion size and
overall pitch level equally show regional variation. Belgian women speak at a
higher pitch than Dutch women (van Bezooijen 1993), and Gilles (2005: 165)
reported variation in f0 excursion size of falling contours between speakers of
eight varieties of German. Ulbrich (2005) reported differences in pitch range
between speakers of two standard varieties of German (Swiss and Northern
German).2 Finally, the dialects spoken on the Orkney and Shetland islands
differ in overall pitch level, with intonation contours in the Orkney variety
being realized at a higher pitch (van Leyden 2004).
For Dutch, dialectal characteristics had until recently only been described
informally (van Es 1935; Daan 1938; Weijnen 1966). Two studies have
4.2 Procedure
4.2.1 Materials
We used three sets of sentences. The first set contained four declarative narrow-
focus carrier sentences with a non-final falling pitch accent (nf-FALL); the
second set contained four declarative narrow-focus carrier sentences with an
IP-final falling pitch accent (f-FALL); and the last set contained four rhet-
orical questions with an IP-final falling-rising pitch accent (f-FR). All 12 car-
rier sentences (labeled “B”) were preceded by a context sentence (“A”) with
which they formed a mini-dialogue, as illustrated in Table 4.1. In the non-final
declaratives, the target words consisted of fictitious place names, Momberen,
Memberen, Manderen, Munderen,3 which had the metrical pattern sww, in
which the segmental structure of the accentable first syllable was Nasal-V-
Nasal, followed by a voiced plosive onset consonant. They were followed by
a sequence of two sw verbs. In the carrier sentences for the accentable IP-
final position, four fictitious monosyllabic proper names, Lof, Loof, Lom,
Note: The target sentences are printed in bold; the word carrying the nuclear pitch accent is
Loom, were used as target words in each pragmatic condition. These varied
in the rime only, where short [ɔ] and long [oː] combined with voiceless [f]and
sonorant [m].
A slightly modified version of these sentences was used to collect the
Standard Dutch data. The sentences shown in Table 4.1 were used for
Zuid-Beveland, Rotterdam, and Amsterdam. Speakers from Zuid-Beveland
translated the sentences into their variety as they spoke. We translated the
sentences into the local language for speakers of West Frisian and Low
Saxon, which have standardized spelling systems. For all varieties, the
rhythmic, lexical, and segmental contexts were comparable to the Standard
Dutch materials. A list of the sentences in all language versions is given in the
North Sea
a Scouting club (RO, AM), or members of the local community (GR, WI).
The speakers from Zuid-Beveland, Grou, and Winschoten were bilingual with
Standard Dutch and their local language. All regional speakers and at least
one of their parents were raised in the selected place and spoke the indigenous
variety fluently. For Standard Dutch, the procedure was different, as the area
where this variety is spoken is less determined by geographical boundaries.
Speakers could participate if they reported to speak Standard Dutch. Besides
self-reporting, two Dutch phoneticians independently judged each recording.
Recordings were included if the judges agreed that the geographical and lin-
guistic origin of the participants could not be determined by their accent.
Except for the speakers of West Frisian and Standard Dutch, our speakers
were less familiar with their local language as a written language, which may
have had a negative influence on the fluency of the speech in the reading task
of some speakers.
Participants’ recordings were excluded if they were (highly) disfluent or
appeared to the experimenter not to speak naturally; if the speakers afterward
reported that they were dyslexic or had hearing problems; or if the speakers
turned out not to satisfy the requirements with respect to their linguistic or
geographical background. All participants were naive as to the purpose of the
task and were paid for their participation.
Table 4.2 Number of speakers used in the analyses, broken down by variety, sentence
condition, and gender
nf-FALLS f-F
SD 13 8 21 9 8 17 13 9 22
ZB 7 10 17 7 8 15 6 2 8
RO 7 12 19 3 10 13 7 8 15
AM 7 11 18 4 2 6 6 6 12
GR 20 3 23 18 3 21 20 2 22
WI 13 4 17 12 4 16 13 2 15
Table 4.4 Acoustic variables used in the comparison of non-final and final nuclear
contours in five varieties
Durational variables
RimeDuration the duration of the t(O2) – t(N1) ✓ ✓ ✓
sonorant rime
of the nuclear
syllable in ms
Timing variables
H-RelTiming the timing of H as (t(H) – t(N1)) / ✓ ✓ ✓
a proportion of (t(O2) – t(N1))
the sonorant rime * 100
duration in %
Scaling variables
H-Scaling the height of the f(H) ✓ ✓ ✓
nuclear peak in
ST re 100 Hz
L-Scaling the height of the f(L) ✓ ✓ ✓
elbow following
the nuclear peak
in ST re 100 Hz
H2-Scaling the height of the f(H2) ✓
final boundary
tone in fall-rises
in ST re 100 Hz
Contour shape variables
FallDuration the duration of the t(L) –t(H) ✓ ✓ ✓
fall following
the nuclear peak
in ms
FallExcursion the excursion of f(L) –f(H) ✓ ✓ ✓
the fall following
the nuclear peak
in ST
FallSlope the rate of change FallExcursion/ ✓ ✓ ✓
of the fall FallDuration
following the *1000
nuclear peak in
RiseDuration the duration of the t(H2) –t(L) ✓
final rise in fall-
rises in ms
RiseExcursion the excursion of f(H2) –f(L) ✓
the final rise in
fall-rises in ST
RiseSlope the rate of change RiseExcursion/ ✓
of final rise in RiseDuration
fall-rises in ST/s *1000
RatioFRDur relation between FallDuration / ✓
duration of RiseDuration
falling and rising
part of fall-rise
Mean Rime_dur
Mean sonorant rime duration in non-final falls, final falls, and final
Figure 4.2
fall-rises for each variety. Error bars represent ±2 standard errors of
the mean
Since female speakers on average speak at a higher pitch level than male
speakers (225 Hz vs. 125 Hz), we measured f0 in semitones. This will to a large
extent normalize gender variation where excursion sizes are concerned, but
will not normalize differences in the scaling of individual pitch targets (such
as the scaling of the nuclear peak). The effects of Dialect on tonal scaling
(H-Scaling, L-Scaling, H2-Scaling) will therefore be reported for the lar-
gest gender group only, female speakers
Dialect Gender
4.3 Results
nf-FALL **
SD vs. RO ZB vs. AM RO vs. GR AM vs. WI
nf-FALL *
SD vs. AM ZB vs. GR RO vs. WI
nf-FALL **
SD vs. GR ZB vs. WI
nf-FALL **
f-FALL *
SD vs. WI
nf-FALL ***
f-FR *
Mean H_REL timing
Figure 4.3 Mean proportional peak timing in non-final falls, final falls, and final fall-
rises for each variety. Error bars represent ±2 standard errors of the mean
(a) (b)
20 20
15 15
10 10
5 5
0 0
H1_Scaling L2_Scaling
nf-FALLS, AM peaks are significantly later than all other varieties, with
mean differences ranging from 18 (AM-RO) to 29 percent (AM-WI).
Figure 4.5 Mean scaling in semitones of H, L, and H2 in f-FR for each variety. Error
bars represent ±2 standard errors of the mean
by f-FR (16.7 ST) and f-FALLS (15.9 ST). Although peaks are always highest
in non-final falls, they are not always lowest in final falls, which may have
caused the interaction.
As for scaling of the low target (L), we found a main effect of Dialect
[F(5,64) = 2.97, p<.05], of Sentence_condition [F(2,13) = 152.40, p<.001],
and a Dialect * Sentence_condition interaction [F(10,592) = 4.83,
p<.001]. Post- hoc tests revealed no significant differences in L- Scaling
between any of the dialects, but showed that L in fall-rises was significantly
higher (13.1 ST) than both non-final (10.3 ST) and final falls (9.6 ST) at
Separate analyses for final and non- final falls showed that H- Scaling
was significantly affected by Dialect in f-FALLS [F(5,48) = 2.71, p<.05],
although none of the varieties differed significantly in post-hoc tests. In f-FR,
Dialect did not significantly affect H-Scaling, but as Table 4.9 shows, we
did find a main effect of Dialect on L-Scaling and H2-Scaling.
Table 4.9 Effects of Dialect on the scaling of the nuclear peak, the elbow, and the
final high target in final fall-rises
550 10 90
450 8
300 50
250 40
100 2
50 10
0 0 0
Figure 4.6 Mean f0 duration in ms (left panel), f0 excursion in ST (center panel) and f0 slope in ST/s (right panel) for non-final and final
falls, broken down by dialect. Error bars represent ±2 standard errors of the mean
nf-FALL F(5,107) = 24.10 p<.001
f-FALL F(5,76) = 3.18 p<.05
nf-FALL F(5,103) = 6.51 p<.001
f-FALL F(5,76) = 4.52 p<.001
nf-FALL F(5,105) = 8.45 p<.001
f-FALL F(5,77) = 5.19 p<.001 Fall-rises
Regional differences in the shape of the fall-rise may be due to differences in
the shape of the falling movement (H* to L), of the final rise (L to H%), or
both. We therefore measured pitch movement duration, excursion, and slope
separately for the falling movement (FallDuration, FallExcursion,
and FallSlope) and the final rise (RiseDuration, RiseExcursion, and
RiseSlope). The bar charts in Figure 4.7 show that there are rather large
differences in the shape of the fall-rise, with ZB, GR, and WI differing most
from the other varieties, each in its own way. For the falling movement, WI
has the longest duration, largest excursion, and steepest slope. ZB, on the
other hand, has the shortest duration, smallest excursion, and shallowest
slope for both the falling and the rising movement. In GR, the difference
between the falling and rising part is small in terms of duration, excursion,
and slope. These examples illustrate that regional patterns vary for the falling
movement, the rising movement, and the relation between the two.
We found a main effect of Dialect on FallDuration [F(5,83) = 2.66,
p<.05] and RiseSlope [F(5,83) = 2.94, p<.05]. We also found a main effect of
Gender on RiseDuration [F(1,85) = 4.69, p<.05] and a Gender*Dialect
interaction [F(1,84) = 2.42, p<.05] for FallExcursion.
110 7 60
6 50
80 5
30 2
20 10
0 0 0
F0Dur_FR1 F0Dur_FR2 F0Exc_FR1 F0Exc_FR2 RofCh_FR1 RofCh_FR2
Figure 4.7 Mean f0 duration in ms (left panel), f0 excursion in ST (center panel), and slope in ST/s (right panel) of the falling (FR1)
and rising (FR2) movements of final fall-rises. Error bars represent ±2 standard errors of the mean
Figure 4.8 Duration ratio, excursion ratio, and slope ratio between the falling and
rising movement of f-FR
notably in non-final falls, which may partially be due to the larger numbers
of women in the GR and WI groups. In final falls, male speakers showed
a steeper falling slope, though not consistently across all dialects. Finally,
women produced longer, and in some dialects also larger, final rises in H*L
H% nuclear accents. These features might be explained as an enhanced use of
the high-pitched end of Ohala’s (1983) Frequency Code (Gussenhoven 2016). Duration
Sonorant rime durations gradually increased from the southwest (ZB) to the
northeast (GR, WI), showing a weak geographical component. ZB generally
had the shortest durations, and WI the longest, matching the first half of
the inverted U-shape reported in Peters et al. (2014). The short segmental
durations for ZB and the long ones in GR are in agreement with that study,
which investigated the effects of focus condition on the realization of non-
final declarative falls in varieties of Dutch (but which did not look at Standard
Recall from Section 4.2.3 that speakers from WI often pronounced the
target words, for example, “Manderen” as a disyllabic word, [mɑndə(ː)n]
instead of [mɑndərə]. This reduction might partly explain the long rime
durations, since the fewer unstressed syllables that occur after the main stress
in a word, the longer its stressed syllable (Nooteboom 1972; Rietveld et al.
2004). However, because in the case of final falls and final fall-rises seg-
mental duration was always longest in WI for identically pronounced target
words, speakers of WI may safely be said to have longer segmental durations
131 Scaling
Across sentence conditions, Dialect did not systematically affect the scaling
of nuclear peaks or overall pitch level, as was found between the standard
varieties of Dutch in the Netherlands and Belgium by van Bezooijen (1993).
The largest effect of Dialect on scaling was found in the fall-rises, where
the valley between the two high tones was realized at much higher f0 in ZB
and RO than in AM and WI. The final high tone was scaled highest in WI.
Differences in the depth of the valley and height of the final high tone have
consequences for the excursion sizes of the falling and rising movements of
the fall-rise, as Figure 4.7 suggests.
ZB Central WI
replicated in our findings, because fall slope for ZB is shallow in our results,
but steep in theirs. Apart from that difference, the dialects show similar
behavior in the two studies. Figure 9 provides schematic representations of
the shape of non-final and final falls.
The fall-rise was rather similar in shape in SD, RO, AM, and GR (see
Figure 4.7 in Section, the “central” realization. ZB and WI deviated
from it in their own ways. Speakers of WI realized it with longer, larger and
steeper falling movements and shorter and shallower final rises than speakers
of the other varieties. As a result, the ratio between the falling and rising
movements in WI differed substantially from that in the other varieties, in
which the shape of the falling and rising movements were comparable (GR)
or where the fall was short, small, and shallow compared to the final rise. The
stylized contours in Figure 4.9 show the distinct shape of the fall-rise in WI
in comparison with the “central” version and ZB. As shown in Figure 4.10,
the shape of the ZB fall-rise is characterized by a shallow dip between the two
high peaks. In Hanssen (2017: 149), it is shown that speakers of ZB often do
not produce such a dip at all. These extremely shallow realizations by speakers
of Zuid-Beveland represent a context-specific response to time pressure that is
absent in the other varieties.
Significantly, the most extreme realizational variation could be observed in
the two geographically most extreme dialects, Zuid-Beveland and Winschoten.
They differed most from each other as well as from the other varieties. In fact,
if we look at the significantly different dialect pairs, 67 percent of them (29
out of 43) involve ZB, WI, or both. In only 14 out of 43 cases are dialects
other than ZB and WI involved in the comparison. Thus, Zuid-Beveland and
1 The differences in tonal timing between the Lowland Scots varieties of Orkney
and Shetland have an additional effect on syllable duration, which is longer for the
Shetland variety. Van Leyden (2004: 69) attributes the longer syllable duration to
the fact that in Shetland, the entire rising movement is realized on the accented
syllable, while in Orkney the peak of the rise is not realized until after the accented
2 In addition, Ulbrich found differences in overall speech rate between the two var-
ieties, which were caused by the number and duration of pauses within sentences.
3 Speakers of Standard Dutch produced three sentences each, with the target words
Manderen, Bunderen, and Lunteren.
4 This point roughly corresponds to the location in the f0 curve where the first deriva-
tive (or slope) is zero. As such, our method is the manual version of the MOMEL
algorithm used as input for the INTSINT transcription system for the representa-
tion of intonation (Hirst 2005).
5 Verhoeven et al.’s work was criticized for methodological errors in Quené (2008),
who nevertheless reached the same conclusion regarding differences in speech
tempo between speakers from Flanders and the Netherlands, and between male
and female speakers.
A context sentence
B carrier sentence, target word in bold (only for Dutch)
i Dutch (Zuid-Beveland, Rotterdam, Amsterdam)
ii West Frisian (Grou)
iii Low Saxon (Winschoten)
iv English gloss
v English translation
vi Standard Dutch pilot
Sentence 2
Ai Waar zouden de Janssens heen willen lopen?
ii Wêr soenen de Janssens hinne rinne wolle?
iii Woar zollen de Janssens hinlopen willen?
iv where would the johnsons to want walk /walk want
v Whereto would the Johnsons like to walk?
Sentence 3
Ai Waar zou Karel je heen willen brengen?
ii Wêr soe Karel dy hinne bringe wolle?
iii Woar zol Karel die hinbringen willen?
iv where would Karel you to want bring /bring want
v Where did Karel want to take you to?
Sentence 2
Ai Van wie is dat dikke boek?
ii Fan wa is dat tsjûke boek?
iii Van wel is dat dikke bouk?
iv of who is that big book
v Whose big book is that?
Sentence 4
Ai Met wie gaat je baas morgen trouwen?
ii Mei wa sil dyn baas moarn trouwe?
iii Mit wel gaait dien boas mörgen traauwen?
iv with who goes your boss tomorrow marry
v Who will your boss marry tomorrow?
Sentence 2
Ai Meester Boelens gaat mee op schoolreis.
ii Master Boelens sil mei op skoalreis.
iii Meester Boelens gaait mit op schoulraaise.
iv Mr Boelens goes along on schooltrip
v Mister Boelens is coming along on the school trip.
Sentence 3
Ai Pepijn de Heer komt straks ook naar ’t feest.
ii Pepijn de Heer komt strak ek nei ’t feest.
iii Pepijn de Heer komt straks ook noar ’t feest.
iv Pepijn de Heer comes later also to the party
v Pepijn de Heer is also coming to the party later.
Sentence 4
Ai Dit antieke horloge is nog van opa Thijssen geweest.
ii Dit antike horloazje hat noch fan pake Thijssen west.
iii Dit antieke hallozie is nog van opa Thijssen west.
iv this antique wristwatch is even of grandfather Thijssen been
v This antique wristwatch used to belong to grandfather Thijssen.
Arvaniti, A., and Garding, G. (2007) “Dialectal variation in the rising accents of
American English” in Hualde, J., and Cole, J. (eds.) Papers in laboratory phonology
9. Berlin: Mouton de Gruyter, pp. 547–576.
Atterer, M., and Ladd, D. R. (2004) “On the phonetics and phonology of ‘segmental
anchoring’ of F0: Evidence from German”, Journal of Phonetics, 32, 177–197.
Boersma, P., and Weenink, D. (2008) Praat: Doing phonetics by computer (Version
5.0.25) [computer program]. Retrieved 31 May 2008 from
Britain, D. (2013) “Space, diffusion and mobility” in Chambers, J. K., and Schilling-
Estes, N. (eds.) Handbook of language variation and change, 2nd edition. Hoboken,
NJ: Wiley-Blackwell, pp. 471–500.
Daan, J. (1938) “Dialect and pitch pattern of the sentence” in Blancquaert, E., and
Pée, W. (eds.) Proceedings of the Third International Congress of Phonetic Sciences.
Ghent: Laboratory of the Phonetics of the University, pp. 473–480.
Del Giudice, A., Shosted, R., Davidson, K., Salihie, M., and Arvaniti, A.
(2007) “Comparing methods for locating pitch ‘elbows’” in Trouvain, J. (ed.)
Proceedings of the 16th International Congress of Phonetic Sciences (ICPhS).
Saarbrücken: Universität des Saarlandes, pp. 1117–1120.
Gilles, P. (2005) Regionale Prosodie im Deutschen: Variabilität in der Intonation von
Abschluss und Weiterweisung. Berlin: Walter de Gruyter.
Grabe, E. (1998) “Pitch accent realization in English and German”, Journal of
Phonetics, 26, pp. 129–144.
Grabe, E., Post, B., Nolan, F., and Farrar, K. (2000) “Pitch accent realization in four
varieties of British English”, Journal of Phonetics, 28, pp. 161–185.
Gussenhoven, C. (2016) “Foundations of intonational meaning: Anatomical and
physiological factors”, Topics in Cognitive Science, 8, pp. 425–434.
Gussenhoven, C., and Rietveld, A. C. M. (1992) “Intonation contours, prosodic struc-
ture, and preboundary lenghthening”, Journal of Phonetics, 20, pp. 283–303.
Gussenhoven, C., and van der Vliet, P. (1999) “The phonology of tone and intonation
in the Dutch dialect of Venlo”, Journal of Linguistics, 35, pp. 99–135.
Hanssen, J. (2017) Regional variation in the realization of intonation contours in the
Netherlands. Utrecht: LOT Publications.
Hanssen, J., Peters, J., and Gussenhoven, C. (2016) “Phonetic effects of focus in five
varieties of Dutch” in Barnes, J., Brugos, A., and Shattuck-Hufnagel, S. (eds.)
Speech Prosody 2016. Boston: International Speech Communication Association
(ISCA), pp. 736–740.
Hirst, D. J. (2005) “Form and function in the representation of speech prosody”,
Speech Communication, 46, 334–347.
Kalaldeh, R., Dorn, A., and Ní Chasaide, A. (2009) “Tonal alignment in three var-
ieties of Hiberno-English” in International Speech Communication Association
(ed.) Proceedings of Interspeech 2009. Brighton: ISCA, pp. 2443–2446.
Kügler, F. (2007) The intonational phonology of Swabian and Upper Saxon.
Tübingen: Max Niemeyer Verlag.
5.1 Introduction
Conventional linguistic wisdom has it that languages are tonal, stress-accented,
or pitch-accented, although Hyman (2006, 2009) has argued rather convin-
cingly that pitch accent is not a type,2 but a language may “pick and choose
properties from the tone and stress prototypes” (Hyman 2009: abstract).
Common definitions of tone language are as given in (1), so that presumably,
something that is not a tone language is a stress language and would have
properties laid out in (2).
The prima facie differences in the definitions in (1) and (2) are in fact not
straightforward when checked against the reality of languages. Starting with
(1a), the definition conjures impressions of minimal pairs involving tone, but
that depends on the interpretation of “meaning”. There are instances where
tone is used to signal different syntactic categorization of what might have
roughly the same semantic core, for example, to clothe and clothes in Standard
Cantonese would both be [ji] but mid tone for the verb and high tone for the
b Central Honshu
i Requires stipulation if initial part of word is H or L
ii Proto-Japanese is more tonal if this is taken in comparison with
Standard Japanese.
c Kagoshima (Kyushu)
Word level tone, like Mende
Honshu Kagoshima
(tone on initial and end of word) (tone at word level)
Also, citing Kikuyu (like Chinese), Tonga (like Honshu), and Ganda (like Std
Japanese), McCawley (1978) suggests the possibility that tone can evolve into
pitch accent, as evidenced by Bantu and Japanese.
As it turns out, there is evidence that stress can evolve into tone too. Wee and
Cheung (2015) report on Hong Kong English as being one such case. Wee and
Cheung (2015) studied the transliterations of a nineteenth-century Cantonese-
English bilingual dictionary and found that stressed English syllables typically
received a transliteration that had a higher (not necessarily high) tone than the
neighboring syllables. In comparison to modern Hong Kong English (which can
be historically traced to the early Cantonese English contact documented by
these early Cantonese-English dictionaries), these syllables now receive a stable
high tone (not just higher). Now, in Hong Kong English, there are minimal tone
pairs such as cán ‘metal container’ and càn ‘canteen’. Juxtaposing these facts,
we are confronted with the possibility that the tonality of Hong Kong English
could have evolved out of a stressed language. A mitigating admission must
be made, however, that the Anglo-Canto contact did include a tone language
like Cantonese, so it may be that tone did not arise from a stress language but
in one. Nonetheless, Kingston did argue that tones in Scandinavian languages
(e.g., the distinction between Accent 1 and Accent 2 in Swedish, Danish, and
Norwegian) and in Central and Low Franconian dialects have evolved from
stress. Tonogenesis as being triggered by prosodic prominence is not unique to
Hong Kong English, and corroboration can be found in other languages.
Thus, from a diachronic perspective, there is also historical phonological
evidence that suggests the very intimate relationship between tone and stress.
5.2.4 Prosody
Stress is normally associated with prosody and may be assigned to feet. Most
conceptions regarding tones are that tones are associated with Tone Bearing
Units (TBUs), and current wisdom favors the mora, as evidenced in Thai (Morén
and Zsiga 2006) and Chinese (Wang 2002). However, it might well be that tones
are really associated with the foot, as can be seen in Kera (Pearce 2013: 141).
(ws)s (s)(ws)
/L/ LLL (dɨm
̀ ɨɨ̀ )̀ mɨ̀ ‘clothes’ (bɔ̀m)(bòrɔ̀ŋ) ‘carp’
/H/ HHH (kǝ́kám)ná ‘chiefs’ (kúŋ)(kúrúŋ) ‘skin
/M/ MMM (celɛɛ)rɛ ‘commerce’ (kaŋ(kǝlaŋ) ‘hat’
/LH/ LLH (gǝ̀dàà)mɔ́ ‘horse’ *
LHH * (dàk)(tǝ́láw) ‘type of
/HL HHL (kǝ́sáá)bɔ̀ ‘cricket’ *
HLL * (mán)(dǝ̀hàŋ) ‘bag’
/MH/ MMH (tɨlɨŋ)kɨ́ ‘hole’ *
MHH * (taa)(mǝ́káá) ‘sheep’
/HM/ HHM (kúɓúr)si *
‘burning coal’
HMM * (sáá)(tǝraw) ‘cat’
(9) “Tone” is where pitch is used consistently for indicating some kind of
prosodic contrast (lexical or postlexical) under normal phonation.
The set of prosodic contrasts would include what is a prosodic head and
what are the dependents. For example, the H-accented syllable in Japanese is
presumably the prosodic head that then spreads its tone to the dependents.
At the word level, such prosodic contrasts would take the form of different
phonological words, for example, má ‘hemp’ and mà ‘to scold’ in Standard
Chinese.10 Tone could also be used to distinguish one construction from
another, for example, lexicalized form in Cantonese wong4 ‘yellow’ versus
wong2 ‘egg yolk’, and to indicate intonation types such as interrogatives and
[pitch] and [contour] are parameters for how a language may express
its prosodic patterns. The definition of [pitch] allows us to cover cases of
whispering because in whispering, the intensity and length cues are intended
for the reconstruction of the F0 that is suppressed. Thus, regardless of how
prosodic contrasts are expressed in a language, phonetic Length, Intensity,
or Pitch (LIP) would all be utilized for expression of the language’s prosodic
“essence”. If modulation of LIP is involved, then that language would also be
specified for [contour].
The Prosodic Essence Conjecture in (11) promises a way of typologizing
languages in terms of how prosody might be expressed. Through the recog-
nition of [p i t c h ] but without confining it to the word level, (11) diffuses the
intractable problem of defining tone. To capture the FUNCTION differences
of “contrastive” and “distinctive”, the conjecture draws upon [c o n t o u r ],
since a language with it shall then be able to expressive prosodic essences
(12) Typology
[contour] Yes No
5.5 Conclusion
The stress-tone distinction of languages is problematic when one looks into
the phonetic properties and the associated phonological patterns of stress and
tone. The two appear to be so closely related as a prosodic cue that they might
conceivably be faces of the same prosodic coin. This chapter takes the counter-
intuitive approach of collapsing both tone and stress, envisioning a two-
parameter system, [p i t c h ] and [c o n t o u r ], to reimagine our understanding
of prosody. Underlying this conjecture is, first, the recognition that what
we have understood as tone relies on pitch as a primary physical property,
a constraint not applicable to stress; and, secondly, that tones may contour.
The pairing of these parameters surprisingly yields a four-way typology that
appears to be supported by actual known languages, offering a fresh perspec-
tive on how their prosodies might be understood. I hasten to add that the
1 Thanks to Winnie H. Y. Cheung, Mingxing Li, and Diana Archangeli for useful
ideas and discussions. If they thought I was insane, they hid their feelings well and
tried to inoculate me with many helpful challenging questions, all of which I only
managed to fudge. Special thanks to the audience at the International Conference
of Prosodic Studies, in particular Hongming Zhang, Jianhua Hu, Shengli Feng,
Jie Zhang, and Chilin Shih, for their insights and encouragement. The research
is supported by financially by GRF-HKBU250712, and in all other aspects by
friendships too numerous to list.
2 Pitch accent is not a prototype, or else we would have things like glottal accent and
so forth (Hyman 2006).
3 But see also Hyman (2007) for a different account, and Wee (2015), who argues in
support of the OCP account.
4 Duanmu added non-head stress for compounds and phrases (p. 136). In the 2nd
edition published in 2007, Duanmu takes a more nuanced position, which still
highlights the elusiveness of stress in Standard Chinese.
5 See Wang (2002) for analysis of such neutral tones and their surface pitch values
of Beijing, Shanghai, and Urumqi, all evoking the mora.
6 Duanmu (2007) presents a two-accent model for describing modern Japanese as
being superior to McCawley’s (1978) one-accent model. The difference does not
affect the point made here.
7 Nonetheless, it should be noted that if the word has only one foot (i.e., disyllabic),
then each syllable can have a different tone, generating LH, HL, MH, and HM
sequences (Pearce 2013: 135–136).
8 See also Clements, Michaud, and Patin (2010), Hyman (2010), and Odden
(2010). These works examine how tone features are fundamentally different from
other phonological features and may not even be easily reduced to their physical
9 The syntagmatic issue is certainly quite complex, since there are languages like
Thai, which does have words that are analyzed as underlyingly toneless (Morén
and Zsiga 2006 believe that mid-tone syllables are underlyingly toneless), but these
are invariably heavy syllables, which I believe goes precisely to show the entangled
relationship between tone and prosody.
10 Phonological words not necessarily real words, because in Standard Chinese there
are gaps involving attested syllables without all four tonal contrasts in the list of
real words, e.g., zhuo3, nu1, la2, gui2, etc.
11 Though of course pitch can be distorted by speech rate and other factors.
12 For an excellent brief, see Wikipedia’s entry on Hawaiian Phonology, accessed 4
August 2015,
Banti, G. (1988) “Two Cushtic systems: Somali and Orono nounda” in van der
Hulst, H., and Smith, N. (eds.) Autosegmental studies in pitch accent systems.
Dordrecht: Foris, pp. 11–49.
Bradshaw, M. (1998) “One- step tone raising in Ali”, OSU Working Papers in
Linguistics, 51, pp. 1–17.
Chao, Y.- R. (1968) A grammar of spoken Chinese. Berkeley: University of
California Press.
Cheung, W. H. Y. (2009) “Span of high tones in Hong Kong English” in Kwon, I.,
Pritchett, H., and Spence, J., (eds.) Proceedings of the 35th Annual Meeting of
Berkeley Linguistics Society (BLS 35), Berkeley, CA, pp. 72–82.
Clements, G. N., Michaud, A., and Patin, C. (2010) “Do we need tone features?”
in Goldsmith, J. A., Hume, E., and Wetzels, L. (eds.) (2010) Tones and
features: Phonetic and phonological perspectives. Berlin: Mouton De Gruyter,
pp. 3–24.
de Lacy, P. (2002) “The interaction of tone and stress in optimality theory”, Phonology,
19(1), pp. 1–32.
Duanmu, S. (2000/ 2007) The phonology of Standard Chinese. Oxford: Oxford
University Press.
Duanmu, S. (2002) “Tone and non-tone languages”. Paper presented at the Eighth
International Symposium on Chinese Languages and Chinese Linguistics, 8–10
November 2002, Academia Sinica, Taipei.
Duanmu, S. (2007) “A two-accent model of Japanese word prosody”, Toronto Working
Papers in Linguistics, 28, pp. 29–48.
Feng, S.-L. (2002) The prosodic syntax of Chinese. Munich: Lincom Europa.
Fu, Q.-J., and Zeng, F.-G. (2000) “Identification of temporal envelope cues in Chinese
tone recognition”, Asia Pacific Journal of Speech, Language and Hearing, 5(1),
pp. 45–57.
Gao, M. (2002) Tones in whispered Chinese: Articulatory features and perceptual
cues. M.A. thesis, University of Victoria.
Gao, M.-K., and Shi, A.-S. (1963) Yuyanxue Gailun [Introduction to linguistics].
Beijing: Zhonghua Shuju.
Goedemans, R., and van der Hulst, H. (2013) “Weight-sensitive stress” in Dryer,
M. S., and Haspelmath, M. (eds.) The world atlas of language structures online.
Leipzig: Max Planck Institute for Evolutionary Anthropology. Available online at, accessed 17 August 2015.
Goldsmith, J. A., Hume, E., and Wetzels, L. (eds.) (2010) Tones and features: Phonetic
and phonological perspectives. Berlin: Mouton De Gruyter.
Hayes, B. (1995) The metrical theory of stress: Principles and case studies.
Chicago: University of Chicago Press.
Phonological representations
based on statistical modeling
in tonal languages
Si Chen
6.1 Introduction
160 Si Chen
Table 6.1 Eight tones in Chongming Chinese
onset consonants are not present. Also, Chen and Zhang (1997) state that
there is a glottal stop marked by the symbol “ʔ” in Tone 7 and Tone 8, which
reflects the Middle Chinese stop endings /p/, /t/, /k/. However, a phonetic
examination shows that for every speaker, this glottal stop only occurs within
a certain proportion of speech, and the portion varies from speakers to
speakers as reported in Section 6.2. Moreover, Zhang (2009) also reports that
Tone 7 and 8 are short tones, and the duration of each tone is also examined
with statistical analysis in Section 6.2. The phonetic cues examined were then
investigated in a perceptual experiment.
162 Si Chen
Tone 4, but creaky voice helps in identification of Tone 3 only, and not of
Tone 4.
In conclusion, in addition to F0 contours, other properties may contribute
to contrasting tones, including duration, intensity, and phonation types. Some
phonetic cues may play an important role besides F0 contours, as attested in
many tone languages.
164 Si Chen
This study argues that statistical modeling can improve the transformation
method, and the method proposed here has the following advantages: 1)
In providing a phonological presentation, we need to determine whether to
represent a tone as a straight tone or a circumflex tone. This can be statistic-
ally tested using the current model selection procedure specified in Section
6.4. 2) We can only obtain a speech sample from the speech community and
from infinite utterances of each individual speaker. Therefore, we need some
powerful statistical tools in order to model and predict the tonal contours
of the whole speech community. Instead of using averaged F0 contours, it
is better to choose the optimal statistical model and obtain the fitted values
based on the chosen model before we transform the pitch values to Chao’s
letters. 3) This method is based on underlying pitch targets (Xu and Wang
2001; Prom-on et al. 2009; Chen et al. 2017), instead of directly modeling
the surface contours using regression models (Andruski and Costello 2004),
which is argued to conform to articulatory mechanism (Xu and Prom-on
2014). 4) This study uses log z-score normalization to take the perceptual
aspect into consideration (Nolan 2003; Fujisaki 2004 as cited in Prom-On
et al. 2009), though future studies are needed to compare it with a semitone
transformation (Rose 2014). 5) The transformation is considered by assigning
a single obtained fitted value for the adjusted onset, turning point, or the
offset, comparing it with the 20 percent, 40 percent, 60 percent and 80 per-
cent sample quantiles calculated based on all the fitted values. 6) The number
of quantiles can be specified with respect to the requirement of transform-
ation into Chao’s five-level tone letters. The method is relatively flexible and
does not require the number of quantiles to be fixed, which can better adapt
to new tonal models proposed in the future that improve on Chao’s model
to better account for the challenges proposed by Rose (2014) and Paterson
(2015). The details of the model-fitting procedure for this study are described
in the Section 6.4.
The structure of the paper is as follows. Section 6.2 describes the method-
ology, including subjects recruited and materials used in the phonetic examin-
ation, and presents the phonetic results. Section 6.3 introduces the perceptual
experimental design based on the findings in Section 6.2, and presents the
results with respect to accuracy rate and reaction time. Section 6.4 discusses
the transformation, method including the procedure of statistical modeling
and the assignment of tone values. Section 6.5 offers some general discussion,
including future studies, and draws a conclusion.
166 Si Chen
Proportion of glottalization
0 10 20 30
Tone 7 Tone 8
The ANOVA results for testing the duration of Tone 1~7 show that
they are of significantly different duration (F(6, 623) = 26.78, p < 0.001).
A similar analysis was made to test whether Tone 8 is shorter than other
tones. The ANOVA results, including Tone 1 ~ 6 and Tone 8, also show
that these tones have significantly different duration (F(6, 623) = 35.29,
p < 0.001).
For post-hoc analysis, I used Dunnett’s test, treating Tone 7 (T7) and Tone
8 (T8) as a control group in testing the differences in duration with other
tones. From Tables 6.4 and 6.5 as well as Figures 6.3 and 6.4, it can be inferred
that the duration of Tone 7 and Tone 8 is significantly different from other
tones except for Tone 1. Thus, we can conclude that Tones 7 and 8 are shorter
than most other tones.
Tone 1
Tone 2
Tone 3
Tone 4
Tone 5
Tone 6
Tone 7
Tone 8
2–1 ( )
3–1 ( )
4–1 ( )
5–1 ( )
6–1 ( )
7–1 ( )
0 50 100 150
Linear Function
Figure 6.3 The plot for the result of the Dunnett’s test (control group: Tone 7)
Tone Duration
Tone 1 249.01
Tone 2 328.84
Tone 3 380.63
Tone 4 294.20
Tone 5 370.80
Tone 6 370.60
Tone 7 248.43
Tone 8 228.06
170 Si Chen
2-1 ( )
3-1 ( )
4-1 ( )
5-1 ( )
6-1 ( )
7-1 ( )
0 50 100 150
Linear Function
Figure 6.4 The plot for the result of the Dunnett’s test (control group: Tone 8)
in a trial are the same or different (Gerrits and Schouten 2004). Usually,
the number of same trials should match the number of different trials. The
advantages are that the design is simple so that the differences and similarities
do not need to be described to the participants, and they do not need to know
a particular label (McGuire 2010). Also, the reaction time is reliable and easy
to measure because participants make decisions based on the second stimulus.
The disadvantages are that when the task is difficult, subjects tend to respond
“same” more (McGuire 2010).
Identification tasks usually require the participants to give specific labels
for presented sounds. One type of this kind of identification task is the yes-
no task, where subjects are asked whether a certain stimulus is present or
whether it was x or y. It has the advantage of simplicity, but it has the dis-
advantage that there are no “direct comparisons of stimuli in each trial”. In
forced-choice identification tasks, the subject has to provide a label to the
stimulus presented. For example, the subject can be asked to label which
allotone they have heard in a tonal language. This design is also simple,
but when the response set is big, the analyses become difficult. Also, this
forced choice identification task forces subjects to make categorical decisions
(McGuire 2010).
Since my goal in the perceptual experiment is to test whether there are
other contributing cues in addition to F0 contours, it is enough to use the
AX task, which is simple to implement. Reaction time is also used as one of
the response variables, and the AX task can provide a reliable measurement
of it, as mentioned above. Moreover, some direct comparisons are needed in
each trial, which the identification tasks cannot provide. Subjects may also be
confused about the labeling of allotones, which they are not educated for. So
it may require more training on labeling, which introduces the likelihood of
error. The next section illustrates some concerns about ISI conditions before
a description of the experiment setup.
172 Si Chen
as a default choice for phonemic processing. Details about the stimuli and
participants are illustrated in the next section.
T1 T2 T2 T3 T3 T4 T4 T5 T5 T6 T6 T7 T7 T8
T1 T4 T2 T5 T3 T6 T4 T7 T5 T8
T1 T6 T2 T7 T3 T8
T1 T8
Table 6.7 An example of the fractional factorial design with three variables
1a1a + - -
Same 1b1b - + -
Trials 1c1c - - +
1abc1abc + + +
1a2a + - -
Differ 1b2b - + -
-ent 1c2c - - +
1abc2abc + + +
An illustration of the signs “+” and “-” for each variable is listed as follows.
Duration (A):
“+” means that the specific mean duration of a certain allotone calculated
from the production data was used.
“-” means that the overall grand mean of all allotones was used as a
normalized duration value.
Glottalization (B):
“+” means that a glottal stop appeared at the end of the vowel.
“-” means there was no glottal stop or irregular pulses.
174 Si Chen
Onset Consonant (C):
“+” means the onset consonant of the allotone was present.
“-” means the onset consonant was truncated, and the vowel starts at
the first zero-crossing point in the first cycle. The amplitude con-
tour was adjusted to be a gradual slope from intensity value zero to
the original value at the 25th millisecond to avoid abruptness of the
A total of 128 trials are created with an equal number of the same and
different trials. The different trials consist of 16 pairs and four types of
“+” and “-” combinations, as listed in Table 6.7. In this perceptual study,
16 listeners were recruited, including eight females and eight males, who
were around 45 years old. The participants have lived in Qidong city for
most of their lives, with minimal contact from other languages and dialects.
No participants reported a history of speaking, hearing, or language diffi-
culty, and they received financial compensation for their participation. The
University of Florida Institutional Review Board approved the experimental
procedures. A Shure SM2 headset was used for the perceptual experiment.
The modified stimuli were presented to the participants using the software E-
prime, and they were trained in a training session with feedback on whether a
response is correct or not before the real trials started. In the training session,
participants were trained to familiarize themselves with the allotones and the
corresponding plotted graphs, and to press the button standing for “same”
when they believed the same allotones occurred in one trial, and “different”
if they believed different allotones were presented.
There are two kinds of response values collected from the participants. The
first response is the A’ value, which is calculated from correct and incorrect
responses to the stimuli. All the same and different trials consist of 50 percent
of the data to achieve a balance. The A’ value is set to 0.5 when the hit rate
(HA) is the same as the false alarm rate (FA). It is calculated for the other
conditions as follows (Snodgrass et al. 1985: 451):
1. H>FA, A’=0.5+(H-FA)(1+H-FA)/(4H(1-FA))
2. H<FA, A’=0.5-(FA-H)(1+FA-H)/(4FA(1-H))
The A’ values are calculated for each set of a, b, c, and abc, where each
set has different “+” and “-” values for the three variables to be tested. The
A’ values for all the sets a, b, c, and abc are averaged in order to test whether
there are main effects of duration, glottalization, and onset consonants.
A similar procedure is conducted on the second response, namely reaction
time (RT). Average reaction time is also used to test main effects of duration,
glottalization, and onset consonants. In order to examine whether these cues
contribute to certain allotone pairs, t-tests are conducted, as illustrated in the
next section.
a + - - 1281.27
b - + - 1303.75
c - - + 1295.31
abc + + + 1294.93
Factor Name
B A Duration
B Glottalization
C Onset Consonant
Effect Type
98 Not Significant
Factor Name
90 A Duration
85 B Glottalization
80 C Onset Consonant
0 10 20 30 40 50 60 70 80 90
Absolute Effect
Lenth’s PSE = 33.144
a + - - 26.72
b - + - 25.91
c - - + 26.53
abc + + + 26.41
Factor Name
B A Duration
B Glottalization
C Onset Consonant
0 1 2 3 4 5 6 7
Lenth’s PSE = 0.515625
178 Si Chen
Effect Type
98 Not Significant
Factor Name
90 A Duration
85 B Glottalization
80 C Onset Consonant
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Absolute Effect
Lenth’s PSE = 0.515625
180 Si Chen
Table 6.13 A t-test for accuracy rate and reaction time concerning glottalization
(Allotone 8)
Table 6.14 A t-test for accuracy rate and reaction time concerning duration (Allotone 7)
With normalized 1.36 0.097 t(190) = 0.19 1603.97 965.85 t(162)= 1.38
duration p = 0.85 p = 0.17
(T1bT8 b, T3bT8b,
T1cT8 c, T3cT8c,
Without normalized 1.36 0.095 1441.76 621.23
(T1aT8 a, T3aT8a,
T1abcT8 abc,
T (t ) = at + b
y (t ) = βe − λt + at + b
where T(•) represents the underlying target, and y(•) represents F0 values
on the surface. When t = 0, the coefficient β is the distance between F0 con-
tour and the underlying pitch target. The parameter λ represents the rate
of approaching the target. Wong (2006) uses a similar model to predict the
underlying pitch targets for Cantonese tones.
Prom-On et al. (2009) propose a third-order critically damped system,
which constrains the variable control parameters. The model has the form
x (t ) = mt + b
f0 (t ) = (c1 + c2 t + c3t 2 )e − λt + sx (t )
182 Si Chen
where f0(t) is the response of frequency, the underlying pitch target is x(t),
and λ represents the rate of approaching the target. The three parameters are
determined by the initial F0 values, initial velocity, and initial acceleration.
The procedure for selecting the models and testing for the polynomial
degree follows Chen et al. (2017). First, four models were fit using non-linear
regression models. The four models are the following:
y (t ) = βe − λt + at + b
y (t ) = (c1 + c2 t + c3t 2 )e − λt + at + b
y (t ) = βe − λt + dt 2 + at + b
y (t ) = (c1 + c2 t + c3t 2 )e − λt + dt 2 + at + b
In order to find the plausible initial values, the following steps are taken:
1 I plot the function in “sim_1”, and change the parameters so that the
shape is similar to the curve connecting the mean values of y to obtain
initial values.
2 To fit “com_1”, I first use the estimated values for λ, a and b obtained by
fitting “sim_1”.
Then I use the estimated value of β obtained by fitting “sim_1”, and
add a white noise term to regress β onto t and t2 to obtain an initial esti-
mate for c1, c2, and c3.
3 To fit “sim_2”, I first use the estimated λ value by fitting “sim_1”, and
create a covariate called et that equals exp(-λt), and regress y onto et, t
and t2 to obtain the value for β, d, a and b.
4 To fit “com2”, I first use the estimated value for λ, d, a and b by fitting
“sim_2”. Then I use the estimated value of β obtained by fitting “sim2”,
and add a white noise term to regress β onto t and t2 to obtain an estimate
for c1, c2, and c3.
Comparison of the Fitted Tone 1 Contour and the Mean Tone 1 Contour
Log normalized Frequency
5 10 15 20
Time Points
Figure 6.9 Tone 1
Comparison of the Fitted Tone 2 Contour and the Mean Tone 2 Contour
Log normalized Frequency
5 10 15 20
Time Points
Figure 6.10 Tone 2
Comparison of the Fitted Tone 3 Contour and the Mean Tone 3 Contour
Log Normalized Frequency
5 10 15 20
Time Points
Figure 6.11 Tone 3
Comparison of the Fitted Tone 4 Contour and the Mean Tone 4 Contour
Log Normalized Frequency
5 10 15 20
Time Points
Figure 6.12 Tone 4
Comparison of the Fitted Tone 5 Contour and the Mean Tone 5 Contour
Log Normalized Frequency
5 10 15 20
Time Points
Figure 6.13 Tone 5
Comparison of the Fitted Tone 6 Contour and the Mean Tone 6 Contour
Log Normalized Frequency
5 10 15 20
Time Points
Figure 6.14 Tone 6
Comparison of the Fitted Tone 7 Contour and the Mean Tone 7 Contour
Log Normalized Frequency
5 10 15 20
Time Points
Figure 6.15 Tone 7
Comparison of the Fitted Tone 8 Contour and the Mean Tone 8 Contour
Log Normalized Frequency
5 10 15 20
Time Points
Figure 6.16 Tone 8
Based on the criteria in Table 6.17 and the fitted values for the onset and
offset of tones, the transformation can be done accordingly. Since Tones 3, 4,
5, and 6 have underlying targets of a polynomial degree of two, the turning
points of these tones were also transformed. The turning points are found by
setting the derivative of the function in the chosen model for each tone to be
zero and solving a non-linear equation using the software R package “nleqslv”.
After the positions of the turning points are calculated, the values are plugged
into the original model to obtain the fitted values for the turning point.
For example, in calculating the positions of the turning point for Tone 4,
the derivative of the function y(t) is calculated and set to zero to solve the
non-linear equation for t, which stands for the position of the turning point:
The solution is approximately t = 3.66, which is then plugged into the function
y(t) for the fitted turning point F0 value, and a further transformation is done
based on the criteria in Table 6.17. The final phonological representation is
demonstrated in Table 6.18.
Tone values in the parenthesis are shown in Table 6.1.
Compared with the values in Table 6.1 from impressionistic data, the
basic contours are characterized similarly, but specific values (one to five) are
assigned with some differences. Also, the underlying targets tested for being
188 Si Chen
Table 6.16 Models chosen for each tone and estimated coefficients
Tone Model c1 c2 c3 β λ d a b
quadratic (Tone 3, 4, 5, and 6) are also described using three Chao’s letters
in the description of the fieldwork study except for Tone 5, which shows con-
sistency in the usage of the turning point based on statistical modeling of
instrumental data and pure impressionistic data. Specifically, Tone 1 showed a
decreasing slope in my representations 51, as plotted in Figure 6.9, which is in
contrast to previous representations of a flatter slope 53. Tone 2 has a similar
contour to previous representations with higher offset values. Tone 3 has a
higher onset and lower turning point and offset. Tone 4, Tone 5, and Tone
6 have a higher onset, and Tone 7 has a steeper falling slope. Finally, Tone
8 shows a higher onset and offset, but still a rising slope as in the fieldwork
description. The transformed values in this study conform well to the figures
of the tonal contours presented.
impressionistic data. It is hoped that this method can be refined and adapted
to new sets of data in order to assist in finding a plausible tonal representation
for new phonetic data.
This study first focuses on phonetic examinations of several phon-
etic cues subject to a perceptual experiment, and it proceeds to statistically
model Chongming Chinese tones to provide a phonological representa-
tion. Before illustrating how phonetic data is transformed into phonological
representations, a perceptual study is conducted to test whether other phon-
etic cues reported in production, such as duration and glottalization, con-
tribute to discrimination of allotone pairs after voiced versus voiceless onset
in general and for specific pairs, and whether F0 contours suffice to discrim-
inate allotone pairs without including onset consonants. The perceptual
experiment used a fractional factorial design to evaluate whether accuracy
rate and reaction time are affected by three variables: duration (normalized
or original), glottalization (with or without), and onset consonants (with or
without). The results show no statistical significance obtained with respect to
accuracy rate and reaction time for the allotones tested in general. Moreover,
although glottalization and short duration are reported in the fieldwork
records, and confirmed by phonetic examination in this study to accompany
Allotone 7 and Allotone 8 specifically, they do not seem to contribute signifi-
cantly in perceptually discriminating these allotones when F0 contours are
Based on the results of the perceptual experiment, F0 contours do play
an important role in the discrimination of the allotone pairs with respect
to voiced versus voiceless onset, and they may be the primary cue, sub-
ject to further experiments. This study then proceeds to statistically model
F0 values extracted, in order to obtain a phonological representation for
monotones. Based on previous research, four models were fitted to calcu-
late each underlying pitch target, and the optimal model with minimized
AIC was chosen. Whether the underlying target is quadratic or not can also
be statistically tested, and if the underlying target is quadratic, a turning
190 Si Chen
point is found mathematically and transformed based on the chosen model.
If the underlying target is not quadratic, the turning point is not included
in the phonological representation. Then, an adjusted onset point to alle-
viate perturbation effects and an offset point are calculated for further trans-
formation. The plots of generated fitted values based on the selected model
showed similar contours to the plots of averaged F0 values. The fitted values
obtained from the optimal model were calculated, and four sample quantiles
(20 percent, 40 percent, 60 percent, and 80 percent) are calculated from all
the fitted F0 values. These quantiles are needed to transform the adjusted
onset, the turning point, and the offset to Chao’s five points by evaluating the
value with the proposed quantiles. The tones that are statistically tested to be
quadratic correspond well to the fieldwork description, and the basic tonal
shapes are similar, with some differences in the exact integers for the onset,
turning point, and offset.
For future studies, more perceptual experiments need to be designed,
testing allotone pairs with a difference in aspiration of onsets, and the iden-
tification tasks may also be used to detect whether other cues may contribute
more in identification tasks than discrimination tasks. More evaluations
need to be done on the effectiveness of Chao’s model, and the transform-
ation methods can be further developed according to the improved model.
The effectiveness of the transformation methods can also be evaluated by
perceptual experiments. The proposed method in this study is still prelim-
inary, and more data on tonal languages need to be collected to compare
the current methods with methods proposed in the literature for further
The help of language consultants in collecting Chongming Chinese data is
highly appreciated. Comments and suggestions concerning statistical mod-
eling and R code checking from professors in the Department of Statistics,
University of Florida, are gratefully acknowledged. This work is supported
by grant [1-ZVHH] from the Faculty of Humanities and grant [G-UAAG]
from the Department of Chinese and Bilingual Studies at the Hong Kong
Polytechnic University, and partly supported by Early Career Scheme [No.
T26023416] from the Research Grants Council of Hong Kong.
’t Hart, J., Collier, R., and Cohen, A. (1990) A perceptual study of intonation: An
experimental phonetic approach to speech melody. Cambridge: Cambridge
University Press.
Abramson, A. S. (1972) “Tonal experiments with whispered Thai” in Valdman, A.
(ed.) Papers on linguistics and phonetics to the memory of Pierre Delattre. The
Hague: Mouton, pp. 29–55.
Abramson, A. S. (1975) “The tones of Central Thai: Some perceptual experiments”
in Harris, J. G., and Chamberlain, J. R. (eds.) Studies in Thai linguistics in honor of
William J. Gedne. Bangkok: Central Institute of English Language, pp. 1–16.
Andruski, J. E. (2006) “Tone clarity in mixed pitch/phonation-type tones”, Journal of
Phonetics, 34(3), 388–404.
Andruski, J. E., and Costello, J. (2004) “Using polynomial equations to model pitch
contour shape in lexical tones: An example from Green Mong”, Journal of the
International Phonetic Association, 34(2), pp. 125–140.
Bao, Z.-M. (1990) On the nature of tone. Ph.D. Diss., Massachusetts Institute of
Belotel-Grenie, A., and Grenie, M. (1997) “Phonation and tone types in Standard
Chinese”, Cahiers de Linguistique Asie Orientale, 26(2), pp. 249–279.
Blicher, D. L., Diehl, R. L., and Cohen, L. B. (1990) “Effects of syllable duration on
the perception of the Mandarin Tone 2/Tone 3 distinction: Evidence of auditory
enhancement”, Journal of Phonetics, 18(1), pp. 37–49.
Boersma, P., and Weenink, D. (2013) Praat: Doing phonetics by computer [Computer
program]. Version 5.3.52. Retrieved 12 June 2013,
Brunelle, M. (2003) “Tone coarticulation in Northern Vietnamese”, in Solé, M. J.,
Recasens, D., and Romero, J. (eds.) Proceedings of the 15th International Congress
of Phonetic Sciences. Barcelona: ICPhS Archive, pp. 2673–2676.
Brunelle, M. (2009) “Northern and Southern Vietnamese tone coarticulation: A
comparative case study”, Journal of the Southeast Asian Linguistics Society, 1,
pp. 49–62.
Burnham, D., and Francis, E. (1997) “The role of linguistic experience in the percep-
tion of Thai tones” in Abramson, A. S. (ed.) South East Asian linguistic studies
in honour of Vichin Panupong. Science of Language 8. Bangkok: Chulalongkorn
University Press, pp. 29–47.
194 Si Chen
Chang, L. M. (1992) A prosodic account of tone, stress and tone sandhi in Chinese
languages. Ph.D. Diss., University of Hawaii.
Chao, Y. R. (1930) “A system of tone letters”, Le Maitre Phonetique, 45, pp. 24–27.
Charpentier, F., and Stella, M. (1986) “Diphone synthesis using an overlap-add tech-
nique for speech waveforms concatenation”, Proceedings of ICASSP, 86(3), pp.
Chen, M. (2000) Tone sandhi patterns across Chinese dialects. Cambridge: Cambridge
University Press.
Chen, M., and Zhang, H.- M. (1997) “Lexical and postlexical tone sandhi in
Chongming” in Wang, J.-L., and Smith, N. (eds.) Studies in Chinese phonology.
Berlin and New York: Mouton de Gruyter, pp. 13–52.
Chen, S., Zhang, C., McCollum, A. G., and Wayland, R. (2017) “Statistical modelling of
phonetic and phonologised perturbation effects in tonal and non-tonal languages”,
Speech Communication, 88, pp. 17–38. doi: 10.1016/j.specom.2017.01.006
Chomsky, N., and Halle, M. (1968) The sound pattern of English. New York: Harper
and Row.
Clements, G., and Ford, K. C. (1979) “Kikuyu tone shift and its synchronic
consequences”, Linguistic Inquiry, 10(2), 179–210.
Cohn, A. C. (1990) Phonetic and phonological rules of nasalization. Ph.D. Diss.,
University of California Los Angeles. Distributed as UCLA Working Papers in
Phonetics 76.
Cohn, A. C. (2007) “Phonetics in phonology and phonology in phonetics”, Working
Papers of the Cornell Phonetics Lab, 16, pp. 1–31.
Davison, D. S. (1991) “An acoustic study of so- called creaky voice in Tianjin
Mandarin”, University of California Working Papers in Phonetics, 78, pp. 50–57.
Duanmu, S. (1990) A formal study of syllable, tone, stress and domain in Chinese
languages. Ph.D. Diss., Massachusetts Institute of Technology.
Duanmu, S. (1994) “Against contour tone units”, Linguistic Inquiry, 25(4), pp.
Flemming, E. (2001) “Scalar and categorical phenomena in a unified model of
phonetics and phonology”, Phonology, 18, pp. 7–44.
Fujisaki, H. (2004) “Prosody, information, and modeling: With emphasis of tonal
features of speech”, in Bel, B., and Marlien, I. (eds.) Speech prosody. Nara: ISCA
Archive, pp. 1–10.
Fujisaki, H., and Kawashima, T. (1969) “On the modes and mechanisms of speech
perception”, Annual Report of the Engineering Research Institute, 28, pp. 67–73.
Fujisaki, H., Wang, C., Ohno, S., and Gu, W.-T. (2005) “Analysis and synthesis of fun-
damental frequency contours of Standard Chinese using the command-response
model”, Speech Communication, 47, pp. 59–70.
Gandour, J. (1978) “The perception of tone” in Fromkin, V. (ed.) Tone: A linguistic
survey. New York: Academic, pp. 26–37.
Gerrits, E., and Schouten, M. E. H. (2004) “Categorical perception depends on the
discrimination task”, Perception & Psychophysics, 66(3), 363–376.
Goldsmith, J. A. (1976) Autosegmental phonology. Ph.D. Diss., Massachusetts
Institute of Technology.
Goldsmith, J. A. (1999) Phonological theory: The essential readings. Oxford: Blackwell.
Hombert, J. M. (1976) “Perception of tones of bysyllabic nouns in Yoruba”, Studies in
African Linguistics, Supplement 6, pp. 109–121.
196 Si Chen
15th International Congress of Phonetic Sciences. Barcelona: ICPhS Archive, pp.
Paterson, H. J. (2015) “Phonetic transcription of tone in the IPA” in The Scottish
Consortium for ICPhS 2015 (ed.) Proceedings of the 18th International Congress of
Phonetic Sciences. Glasgow, UK: University of Glasgow, pp. 507.1–5.
Pham, A. H. (2003) Vietnamese tone: A new analysis. Outstanding Studies in
Linguistics. London: Routledge.
Pierrehumbert, J. (1980) The phonology and phonetics of English intonation. Ph.D.
Diss., Massachusetts Institute of Technology.
Prince, A., and Smolensky, P. (1993) Optimality theory: Constraint interaction in
generative grammar. MS Thesis, Rutgers University, New Brunswick, NJ, and
University of Colorado, Boulder.
Prom-On, S., Xu, Y., and Thipakorn, B. (2009) “Modeling tone and intonation in
Mandarin and English as a process of target approximation”, Journal of the
Acoustical Society of America, 125(1), pp. 405–424.
Rose, P. (2014) “Transcribing tone –A likelihood-based quantitative evaluation of
Chao’s ‘tone letters’” in Li, H., Meng, H. M., Ma, B., Chng, E., and Xie, L. (eds.)
15th Annual Conference of the International Speech Communication Association
(INTER-SPEECH’14). Singapore: ISCA Archive, pp. 101–105.
Shi, F. (1990) Yuyinxue Tanwei [An exploration of phonetics]. Peking: Peking
University Press.
Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P.,
Pierrehumbert, J., and Hirschberg, J. (1992) “ToBI: A standard for labeling English
prosody” in Proceedings of ICSLP, Banff, pp. 867–870.
Snodgrass, J., Levy-Berger, G., and Haydon, M. (1985) Human experimental psych-
ology. New York: Oxford University Press.
Steriade, D. (2000) “Paradigm uniformity and the phonetics phonology boundary” in
Broe, M., and Pierrehumbert, J. (eds.) Papers in laboratory phonology V: Acquisition
and the lexicon. Cambridge: Cambridge University Press, pp. 313–334.
Sun, X.-J. (2001) “Predicting underlying pitch targets for intonation modeling” in
4th ISCA Tutorial and Research Workshop on Speech Synthesis. Perthshire: ISCA
archive, pp. 143–148.
Valbret, H., Moulines, E., and Tubach, J. (1991) “Voice transformation using PSOLA
technique”, Proceedings of Eurospeech, 91(1), pp. 345–348.
Wayland, R., and Guion, S. (2003) “Perceptual discrimination of Thai tones by native
and experienced learners of Thai”, Applied Psycholinguistics, 24(1), pp. 113–129.
Werker, J. F., and Logan, J. (1985) “Cross-language evidence for three factors in speech
perception”, Perception and Psycholinguistics, 37(1), pp. 35–44.
Whalen, D. H., and Xu, Y. (1992) “Information for Mandarin tones in the amplitude
contour and in brief segments”, Phonetica, 49(1), pp. 25–47.
Wong, Y. W. (2006) “Contextual tonal variations and pitch targets in Cantonese” in
Proceedings of speech prosody. Dresden, pp. 317–320.
Xu, Y. (2005) “Speech melody as articulatorily implemented communicative
functions”, Speech Communication, 46, pp. 220–251.
Xu, Y., and Prom-on, S. (2014) “Toward invariant functional representations of vari-
able surface fundamental frequency contours: Synthesizing speech melody via
model-based stochastic learning”, Speech Communication, 57, pp. 181–208.
Xu, Y., and Wang, Q. E. (2001) “Pitch targets and their realization: Evidence from
Mandarin Chinese”, Speech Communication, 33, pp. 319–337.
Prosodic encoding of contrastive
focus in Shanghai Chinese
Bijun Ling and Jie Liang
7.1 Introduction
It is well known that in speech communication, the same sentence is often
uttered differently depending on the communicative context and the speaker’s
intention. There are at least three ways to package an utterance in order to
integrate it into the information flow of ongoing discourse:
(1) using word order (i.e., given information generally precedes focused
information) (e.g., Birner 1994; Clark and Clark 1978); (2) using particular
lexical items and syntactic constructions (e.g., using cleft constructions
such as ‘It was Damon who fried an omelet’) (Lambrecht 2001); and
(3) using prosody. Prosody is comprised of acoustic features like fun-
damental frequency (f0), duration, and loudness, the combinations of
which give rise to psychological percepts like phrasing (grouping), stress
(prominence), and tonal movement (intonation).
(Breen et al. 2010: 1049)
There have been many studies on the prosodic encoding of different informa-
tion structure notions, especially focus in many languages and from different
Last but not least, there has been a long debate on the relationship between
focus encoding and prosodic structure. Different analyses have been proposed
and can be roughly divided into two sub-groups: (1) Indirect Encoding: The
prosodic encoding of focus is mediated by prosodic structure. In particular,
focus relocates prosodic prominence and in turn triggers inserting or deleting
prosodic boundaries, and the phonetic effects are the results of this modi-
fication (Pierrehumbert 1980; Gussenhoven 1983; Truckenbrodt 1995; Ladd
1996). In other words, the focal f0-rise is a result of the prosodic boundary
Morphosyntax: (Adj. N) NP (V N) VP
炒 饭 (fried rice) 炒 饭 (to fry rice)
/tshɔ33 vε44/ /tshɔ44 vε13/
Prosodic headedness: * *
Prosodic structure: (Adj. N)𝔀 ((V)𝔀 (N)𝔀)𝝋
These unique tonal features make SHC an interesting case for the study of
the prosodic encoding of contrastive focus and the relationship between focus
encoding and prosodic structure. In this chapter, we attempted to answer the
following questions:
1 What are the f0, duration, and intensity patterns of compound words and
VP phrases? Do they reflect the same or different prosodic structures?
2 How does contrastive focus affect the f0, duration, and intensity patterns
of compound words and VP phrases?
3 What is the relationship between focus encoding and the prosodic struc-
ture in SHC, direct or indirect?
7.2 Method
Sentence S1 S2 S3 S4 Meaning
Note: The numbers in the upper right corner indicate the tone type.
Stimulus sentence
7.3 Results
Figure 7.1 displays mean f0 contours of the four target syllables in the four
stimulus sentences, uttered in the non-focused condition. These f0 contours
were obtained by taking 10 f0 points (in Hz) at proportionally equal time
intervals between the acoustic onset and offset of the vowel in the target
syllables, and then these values were transformed into semitones and averaged
across speakers and repetitions.
S1 S2 S3 S4
F0 (st)
1 5 10 1 5 10 1 5 10 1 5 10
Normalized time
Figure 7.1 The time-normalized f0 contours of the four target syllables within the four
sentence types, uttered in non-focused condition. S1, S2, S3, and S4 stand
for the first, second, third, and fourth syllable within the sentence. Sentence
types 1 is “/lɔ3tsɤ1 ma3gɔ1/”; Type 2 is “/lɔ3tsɤ1 sɔ1 ve3/”; Type 3 is “tsɤ1lɔ3
ma3gɔ1/”; and Type 4 is “tsɤ1lɔ3 sɔ1ve3/”
S1+S2 S3+S4
F0 (st)
1 5 10 1 5 10 1 5 10 1 5 10
Normalized time
Figure 7.2 The time- normalized f0 contours of the four stimulus sentences (Types 1–4), uttered in non-
focused condition
(N-F: red) and focused condition with contrastive focus on S1 (F-S1: dark green), on S2 (F-S2: green), on S3
(F-S3: blue), and on S4 (F-S4: purple)
Focus Condition S1 S2 S3 S4
(T1[HH]/T3[LL]) (T1[HH]/T3[HL]) (T1[HL/T3[LH]]) (T1[HL/T3[LH]])
F-S1 T1 maxf0 ↑ 5.083 0.360 14.102 0.000 ↑ 5.083 0.360 14.102 0.000 ↓ -4.577 0.521 -8.790 0.000 ↓ -7.458 0.460 -16.222 0.000
minf0 ↑ 4.660 0.353 13.188 0.000 ↑ 4.660 0.353 13.188 0.000 ↓ -3.518 0.392 -8.975 0.000 ↓ -4.350 0.502 -8.662 0.000
T3 maxf0 0.149 0.222 0.673 0.501 ↑ 3.328 0.416 8.002 0.000 ↓ -1.287 0.378 -3.401 0.001 ↓ -4.142 0.440 -9.418 0.000
minf0 ↓ -1.508 0.337 -4.478 0.000 ↓ -3.310 0.367 -9.021 0.000 ↓ -2.060 0.374 -5.506 0.000 ↓ -2.365 0.460 -5.140 0.000
F-S2 T1 maxf0 ↑ 2.879 0.360 7.986 0.000 ↑ 2.879 0.360 7.986 0.000 ↓ -4.555 0.521 -8.749 0.000 ↓ -6.799 0.460 -14.787 0.000
minf0 ↑ 2.562 0.353 7.250 0.000 ↑ 2.562 0.353 7.250 0.000 ↓ -3.952 0.392 -10.080 0.000 ↓ -4.018 0.502 -8.001 0.000
T3 maxf0 -0.185 0.222 -0.834 0.404 ↑ 2.156 0.416 5.183 0.000 ↓ -1.233 0.378 -3.259 0.001 ↓ -4.113 0.440 -9.354 0.000
minf0 ↓ -0.992 0.337 -2.946 0.003 ↓ -4.622 0.367 -1 2.596 0.000 ↓ -1.963 0.374 -5.247 0.000 ↓ -2.394 0.460 -5.203 0.000
F-S3 T1 maxf0 -0.011 0.360 -0.031 0.975 -0.011 0.360 -0.031 0.975 ↑ 4.588 0.521 8.811 0.000 ↑ 2.037 0.460 4.430 0.000
minf0 0.096 0.353 0.271 0.787 0.096 0.353 0.271 0.787 -0.096 0.392 -0.246 0.806 -0.177 0.502 -0.352 0.725
T3 maxf0 0.134 0.222 0.603 0.546 0.283 0.416 0.680 0.496 ↑ 2.251 0.378 5.950 0.000 ↓ -2.484 0.440 -5.648 0.000
minf0 -0.204 0.337 -0.606 0.544 0.283 0.367 0.771 0.441 -0.483 0.374 -1.291 0.197 ↓ -1.483 0.460 -3.224 0.001
F-S4 T1 maxf0 -0.245 0.360 -0.679 0.497 -0.245 0.360 -0.679 0.497 0.756 0.521 1.451 0.147 ↑ 4.063 0.460 8.837 0.000
minf0 0.068 0.353 0.191 0.848 0.068 0.353 0.191 0.848 0.199 0.392 0.506 0.613 0.560 0.502 1.116 0.265
T3 maxf0 0.213 0.222 0.960 0.337 0.254 0.416 0.611 0.541 ↑ 1.114 0.378 2.944 0.003 ↑ 4.423 0.440 10.059 0.000
minf0 0.356 0.337 1.056 0.291 0.389 0.367 1.059 0.290 1.121 0.374 2.996 0.003 0.503 0.460 1.094 0.274
Table 7.6 The effects of contrastive focus on the rhyme duration and mean intensity of each syllable
Duration S1 S2 S3 S4
F-S1 ↑ 0.331 0.026 12.518 0.000 ↑ 0.097 0.027 3.573 0.000 ↓ -0.051 0.022 -2.316 0.021 ↓ -0.178 0.027 -6.647 0.000
F-S2 ↑ 0.198 0.026 7.501 0.000 ↑ 0.355 0.027 13.139 0.000 0.020 0.022 0.897 0.370 ↓ -0.143 0.027 -5.342 0.000
F-S3 ↑ 0.066 0.026 2.517 0.012 0.053 0.027 1.963 0.050 ↑ 0.347 0.022 15.719 0.000 ↑ 0.028 0.027 1.033 0.302
F-S4 0.039 0.026 1.481 0.139 0.009 0.027 0.345 0.730 ↑ 0.065 0.022 2.958 0.003 ↑ 0.310 0.027 11.614 0.000
Intensity Estimate Std.E t p Estimate Std.E t p Estimate Std.E t p Estimate Std.E t p
F-S1 ↑ 0.031 0.006 5.512 0.000 ↑ 0.030 0.005 5.558 0.000 ↓ -0.042 0.007 -6.464 0.000 ↓ -0.080 0.007 -10.812 0.000
F-S2 ↑ 0.015 0.006 2.731 0.006 ↑ 0.036 0.005 6.803 0.000 ↓ -0.033 0.007 -5.066 0.000 ↓ -0.072 0.007 -9.645 0.000
F-S3 0.001 0.006 0.187 0.852 0.003 0.005 0.473 0.636 ↑ 0.024 0.007 3.634 0.000 -0.007 0.007 -0.893 0.372
F-S4 0.007 0.006 1.221 0.222 0.002 0.005 0.410 0.682 ↑ 0.016 0.007 2.511 0.012 ↑ 0.038 0.007 5.115 0.000
S1 S2 S3 S4
Relative Rhyme Duration
Figure 7.3 Box plots of the rhyme duration (left) and mean intensity (right) of each
target syllable. The middle line represents the median, the box represents
the interquartile range (1st to 3rd quartile), and the whiskers represent
maximally 1.5 times the interquartile range
S1 S2 S3 S4
Relative Mean Intensity
7.4.2 The effects of contrastive focus on f0, duration, and intensity patterns
Contrastive focus is prosodically encoded through the global adjustment of
f0, duration, and intensity of the whole sentence, which can be summarized
as “Tri-zone adjustments”, as Xu (1999) proposed for the prosodic encoding
of focus in Standard Chinese. In Shanghai Chinese, the contrastive focus
also has little effect on the f0, duration, and intensity patterns of the
pre-focus constituents, while it is phonetically realized mainly through
adjusting the f0, duration, and intensity patterns of focused and post-focus
With regard to the focused constituents, a contrastive focus substantially
enhances the f0 realization of the focused syllable by (mainly) raising the
maxf0 and (sometimes) lowering the minf0; it also significantly increases the
intensity and lengthens the duration of the focused syllable. It should be noted
that the f0 and intensity of both syllables within the compound word (S1+S2)
are always enhanced or increased together, although the contrastive focus was
only located on one syllable, either on S1 or S2. In contrast, in VP phrases,
only the f0 and intensity of the focused syllable is enhanced or increased by
contrastive focus (in F-S3 or F-S4 conditions). In other words, the adjustment
domain of f0 and intensity in a compound word is the whole compound, while
that in a VP phrase is only the focused syllable, which verifies that there is a
prosodic boundary between the two syllables of the VP phrase, while there is
no boundary within the compound word. That is, a VP phrase is composed of
a prosodic phrase, while a compound word is composed of a prosodic word
(Selkirk and Shen 1990). Such a result also indicates that the focus-induced
f0 and intensity adjustment are mediated through the prosodic structure in
Shanghai Chinese, which supports the indirect encoding analysis.
However, the focus-induced duration adjustment pattern is different from
the adjustment patterns of f0 and intensity, because not only the duration of
both syllables within the compound word (i.e., S1 and S2) were significantly
lengthened by contrastive focus (in F-S1 and F-S2 conditions), but also the
duration of both syllables in the VP phrase (i.e., S3 and S4) were also both
significantly lengthened by focus (in F-S3 and F-S4 conditions), although the
contrastive focus was only located on one syllable. It seems that the duration
7.5 Conclusion
In this chapter, we examined the f0, duration, and intensity patterns of disyl-
labic compound words and VP phrases in Shanghai Chinese, and further
investigated the effects of contrastive focus on these patterns. The f0, dur-
ation, and intensity patterns of modifier-noun compounds and verb-noun
phrases in normal condition and their different adjustments induced by
contrastive focus, confirmed that the application of left-or right-dominant
sandhi is dependent on the morphosyntactic structure, which is composed
of different prosodic structures. Furthermore, the prosodic encoding of con-
trastive focus is mediated through prosodic structure in Shanghai.
Bates, D., Maechler, M., Bolker, B., and Walker, S. (2014). lme4: Linear mixed-effects
models using Eigen and S4. R package version 1.1–12.
Boersma, P., and Weenink, D. (2010). Praat: Doing phonetics by computer (Version
5.1.30) [Computer program]. Retrieved from
Bartels, C., and Kingston, J. (1994) “Salient pitch cues in the perception of contrastive
focus” The Journal of the Acoustical Society of America, 95(5), p. 2973
Baumann, S., Grice, M., and Steindamm, S. (2006) “Prosodic marking of focus
domains –categorical or gradient?” in Proceedings of speech prosody, Dresden,
Germany, May 2–5, 2006, pp. 301–304. (
Beckman, M. E. (1986) Stress and non-stress accent. Netherlands Phonetic Archives
Series No. 7. Dordrecht: Foris.
Birner, B. (1994) “Information status and word order: An analysis of English inver-
sion”, Language, 70(2), pp. 233–259.
Breen, M., et al. (2010) “Acoustic correlates of information structure”, Language and
Cognitive Processes, 25(7), pp. 1044–1098.
Cambier-Langeveld, T., and Turk, A. (1999) “A cross-linguistic study of accentual
lengthening: Dutch vs. English”, Journal of Phonetics, 27, pp. 255–280.
Cao, J.-F., and Maddieson, I. (1992) “An exploration of phonation types in Wu dialects
of Chinese”, Journal of Phonetics, 20, pp. 77–92.
Chao, Y-R. (1967) “Contrastive aspects of the Wu dialects”, Language, 43, pp. 92–101.
Chen, M. (2000) Tone Sandhi. Cambridge: Cambridge University Press.
Chen, Y.- Y. (2005) “Durational adjustment under contrastive focus in Standard
Chinese”, Journal of Phonetics, 34, pp. 176–201.
Chen, Y- Y. (2006) “Durational adjustment under corrective focus in Standard
Chinese”, Journal of Phonetics, 34, pp. 176–201.
Chen, Y.-Y. (2010) “Post-focus F0 compression –Now you see it, now you don’t”,
Journal of Phonetics, 38, pp. 517–525.
Chen, Y.-Y., and Gussenhoven, C. (2008) “Emphasis and tonal implementation in
Standard Chinese”, Journal of Phonetics, 36, pp. 724–746.
Chen, Y.- Y., and Gussenhoven, C. (2015) “Shanghai Chinese”, Journal of the
International Phonetic Association, 45, pp. 321–337.
Chen, Z.-M. (2014) “On the relationship between tones and initials of the dialects in
the Shanghai area.” Paper presented at the 4th International Symposium on Tonal
Aspects of Language (TAL-2014), Nijmegen, the Netherlands, 13–16 May 2014.
Part III
What kinds of processes are
And how powerful are they?
Ellen M. Kaisse1
8.1 Introduction
The literature on phonological processes that apply between content
words is full of cases like tone sandhi in Chinese languages, tone spread in
Bantu, the placement of intonational boundary tones, and local cases of
resyllabification, vowel deletion, voicing assimilation, or place assimilation
between the final segment of one word and the initial segment of the next.
But some kinds of processes are profoundly underrepresented. Vowel har-
mony rarely extends beyond the word or clitic group, and in the very few, less
familiar, cases reported to extend into the next full word, it often extends only
one syllable onward, not iterating so as to affect the whole word, as its lexical
counterparts do. Stress assignment almost always seems to be word bounded,
or clitic-group bounded at the extreme. Similarly, processes of consonant
harmony –the spread of a consonantal feature such as nasality, anteriority,
or pharyngealization –are typically bounded by the word. In this chapter,
I survey the processes in the phonological literature that have been described
as applying across content words. I will then speculate on why postlexical
application is so strongly skewed toward certain kinds of processes and not
others. Because a survey of postlexical rules perforce must cover a great deal
of ground, I will concentrate here on the question of vowel harmony, com-
paring and contrasting it with processes involving tone, but will include some
discussion of the other kinds of cases mentioned above. I will not be looking
at processes that include closed class, function morphemes that fall outside
the morphological word but that arguably lie within the same prosodic word –
hence my insistence here on operations between “full” or “content” words.
It is well known that function words, especially monosyllabic, closed-class
items, can behave very much like affixes for the purposes of stress rules and
vowel harmony, among many others, so they are not my focus. But later in
this chapter we will speculate on why such closed-class items can be available
to harmony and stress.
Much of the informal typology reported here comes from simple obser-
vation of the literature on postlexical processes, which I have followed
closely and to which I have contributed for several decades. Additionally,
The exact description of the environment for flapping is, of course, the sub-
ject of continuing controversy, but the idea that it is related to resyllabification
is not. Similar examples can be found in dozens of languages, including
Spanish and French (despite the synchronically deeply complicated status of
(4) ʧoʤuk-lar-ɨm=lɑ=mɨ=dɨr
‘is it with my children?’
(5) ʧiʧek-ler-im=le=mi=dir
‘is it with my flowers?’
(8) pʊ ɩl kɐˈmɩnʊ
along the path
In these two languages, harmony does not apply between the members of a
compound word, let alone between independent content words, and this gener-
ally seems to be the case in those languages I have surveyed. However, Tibetan
(Dawson 1980) is reported to have productive Advanced Tongue Root (ATR)
height harmony within compounds. My speculation, which will find further
support in the independent word cases below, is that because compounds are
fixed, lexicalized phrases –some more so than others, of course –and there-
fore contain words that are in frequent collocation, they provide the next most
hospitable environment for vowel harmony to be phonologized.
An interesting case of a statistical tendency toward vowel harmony within
compounds in Turkish was discovered by Martin (2007). More Turkish
compounds have harmony-obeying members than would be expected from a
chance distribution. Based on this case and several others, Martin hypothesizes
that when speakers retrieve words (including compound words), there is a
statistical preference to retrieve those that accord with the phonotactic
generalizations present in the language. Therefore, more Turkish compounds
that accord with vowel harmony are retrieved and subsequently lexicalized.
Such distributions can only be found easily with tools earlier generations
of linguists did not have, since counting compound words in the lexicon is
prohibitively time-consuming. In any case, it seems to be the case that these
generalizations about compounds are very rarely phonologized. We may find
many more statistically imbalanced cases like Martin’s, but I do not think we
are faced with significant underreporting of obligatory or even optional vowel
harmony within compounds in the world’s languages.
While the process within Akan words is iterative and bidirectional (spreading
+ATR from stems to prefixes and suffixes), the one between words is local
and regressive. In other words, the postlexical process is less powerful than its
lexical counterpart.
Archangeli and Pulleyblank (2002) note that the output of the rule is also
gradient in the sense that the leftmost derived +ATR vowel is not quite as
advanced as a canonically +ATR vowel would be.
Mutaka (personal communication 2015) also intuits that the rule can
extend onto the nearest non-low vowel in a verb when the following noun
object is +ATR. In this case the rule is again optional and would not apply
in very deliberate speech. The spread seems to be slightly more limited than
in the noun + adjective cases. Thus in (14), the /ʊ/of the verb stem /sʊŋ/ can
be realized as [suŋ] under the influence of the following +ATR noun, but no
advancement occurs on the preceding syllables [mɔ.tʊ.ka].4
But word plus clitic seems to be about as big as the domain gets. Foot con-
struction does not cross content word boundaries. We don’t expect to find
cases like the fanciful examples below from a language like English, but where
trochaic feet are built from left to right, taking in whole phrases, and thereby
creating wholesale allomorphy in content words depending on the length and
stress pattern of surrounding content words:
The process does not generally cross word boundaries, as shown in (20),
where, since Unbounded Tone Spread is inapplicable due to the first word not
being phrase-final, a rule of Bounded Tone Spread applies exactly two tone-
bearing units rightward.
However, there is a high tone spread rule which does apply between words.
Bickmore and Kula name this Inter- Word Doubling or Binary Spread.
It spreads a word-final high tone onto the initial tone bearing unit of a
following word.
While more than one tone in a sequence has ultimately been changed, the
influence is only from one tone to an adjacent tone. And in many cases of
tone sandhi in the Chinese languages treated by Chen, the effect is purely local
within the foot, with no feeding as in the Tianjin case.
While, as we shall see in the next section, tonal processes can grammaticize
so as to span several syllables or even several words, Copperbelt Bemba and
many Chinese tone sandhi processes indicate that such powerful application
is not universal, nor perhaps even typical.
The many Bantu cases in the literature of which I am aware are phrase-
bounded –that is, they apply in a prosodic domain that is close to isomorphic
with a syntactic phrase: between a verb and its object for instance (Luganda
Low Tone Deletion, Hyman et al. 1987), any head and its complement
(Xitsonga), and so forth. As such, the examples one finds are typically only
two words long, though they may involve many syllables. Here are some of
the many Bantu examples one could mention: in Shambala, a Southern Bantu
language of Tanzania (Philippson 1998) High tone spreads to the penult of
the following word, regardless of its length. In Tiriki, a Southern Bantu lan-
guage of Kenya (Paster and Kim 2011), High tone spreads to all the toneless
tone-bearing units of the preceding word. In Logoori, yet another Southern
Bantu language of Kenya (M. Paster and D. Odden personal communication
2015; fieldwork in progress), High tone spreads similarly to the way it spreads
in Xitsonga, but the phonetic repercussions in Logoori are not easy to hear
if one does not know to look for them. Here we have the inverse of under-
reporting, in the sense that Bantuists know from comparative tonology within
the family that there is likely to be some reflex of tonal spread in the language
they are investigating and therefore are able to find subtle cases that a non-
Bantuist might fail to recognize.
Turning to cases that are not from languages even distantly related to
Bantu, let us look at a low tone deletion rule in Peñoles Mixtec (Otomanguean;
Daly and Hyman 2007). This is a fairly spectacular case because it can involve
tonal triggers and targets in non-adjacent words. Daly and Hyman argue that
Peñoles Mixtec has underlying L and H tones, while the third tone, often
treated as underlying Mid, is really a default Ø tone that is invisible to phono-
logical processes. The tone deletion rule, called OCP(L), deletes the second L
of a L-Ø*-L sequence (where Ø* indicates any number of toneless syllables)
sometimes across many words and many syllables. Daly and Hyman’s longest
example (their (13b); (25) below) shows twelve intervening toneless tone-
bearing units between the L’s. There are three words –including an inflected
verb –between the word containing the trigger L (the first syllable of [dìi-ni-
kʷe-ʃi]) and the word containing the target L, [tʃìu]. The syllables containing
the trigger and target are underlined below.
8.6 Conclusion
Almost any kind of local process can apply between words. Adjacent
consonants can undergo assimilation across a word boundary, adjacent
vowels can undergo deletion or gliding, and final consonants can be recruited
as onsets to the next, vowel-initial word. This exuberance of types is prob-
ably due to the fact that most phonologized processes start life as natural
local effects and these effects are not sensitive to grammatical information
but rather to temporal adjacency. (Kiparsky 1982 et. seq.) Apparently, it is
not difficult for such effects to be grammaticized, though they may remain
optional in the sense that they are most likely to apply in rapid, unguarded,
or informal speech. Iterative processes, such as vowel harmony, consonant
harmony, and metrical stress assignment, however, are more phonologized.
Most vowel harmony rules are word-bounded, though ‘word’ may be defined
phonologically rather than morphologically, so that syntactically independent
function words and clitics may be included in their domain. I ascribed the
rarity of postlexical vowel harmony to a variety of factors. The phonetic
precursors for iteration become weaker the farther one gets from the trigger
vowel; content words are not in frequent collocation with one another so
there are few exemplars to lead to phonologization of these weak effects;
and, from an information-sparing perspective, fully assimilating a contrastive
1 I am grateful to Hamed Al-Tairi, Ryan Bennett, Gunnar Hansson, Sharon Hargus,
Beth Hume, Larry Hyman, Nancy Kula, Andrew Livingston, Dan McCloy, Laura
McGarrity, Philip Mutaka, Andrew Nevins, David Odden, Douglas Pulleyblank,
Stephanie Shih, Richard Wright, and audiences at the Annual Meeting on
Phonology (University of British Columbia, Vancouver 2015) and the First
/n̋ kɑ́ sū zɔ̋ zi̋ zò/→ [n̋ kʌ́‿sū ző‿zi̋ zò] Kaye p 123
I FUT tree under hide ‘I will hide under a tree’
4 Only high vowels are distinctively +ATR and –ATR in Kinande. Other vowels are
underlyingly –ATR only.
5 Odden does not mention the dialect of her speaker.
6 Barnes (2006) cites Inkelas et al. (2001) for evidence that phonetic vowel-to-vowel
coarticulation is problematic as a simple, unaided source for vowel harmony.
Inkelas et al.’s argument comes from Turkish, where anticipatory phonetic effects
are stronger than perseveratory ones, but the phonologized harmony system is
perseveratory. He instead attributes the phonologization of vowel harmony to
vowel-to-vowel coarticulation coupled with lengthening of the trigger syllable and
paradigm uniformity effects allowing longer-distance effects on distant affixes.
7 A particularly nice instantiation of this collocation effect can be found in the work
of Côté (2013) on liaison in Laurentian French. Côté uses transition probabilities
between various invariant words (such as adverbs and prepositions) and following
parts of speech to predict the likelihood that a liaison consonant will appear
between the first word and the second word.
8 Hamed Al-Tairi personal communication 2015) informs me that Bukshaisha (1985,
cited in Habis (1998)) reports that in Qatari Arabic, emphasis spread, a type of
consonant-to-vowel or consonant-to-consonant harmony, is not constrained to one
single word and can cross word boundaries. I have thus far been unable to obtain
a copy of Bukshaisha’s dissertation, but Habis’ examples and summary (p. 167ff)
suggest that the Qatari example is typically postlexical in showing a fading effect
on F2 from the triggering consonant to the emphasized vowel, extending about 600
msec. All the examples cited by Habib affect only the nearest vowel in an adjacent
word, similar to the postlexical vowel harmony cases considered in the previous
section, but Habis cites only monosyllabic target words. This case is clearly in need
of further consideration.
Andrzejewski, B. W. (1955) “The problem of vowel representation in the Isaaq dialect
of Somali”, Bulletin of the School of Oriental and African Studies, 17(3), 567–580.
Archangeli, D., and Pulleyblank, D. (2002) “Kinande vowel harmony: Domains,
grounded conditions and one-sided alignment”, Phonology, 19(2), pp. 139–188.
Match Theory and prosodic
well-formedness constraints
Junko Ito and Armin Mester
9.1 Introduction1
Several strands of work in prosodic theory have recently converged around a
number of common themes, from different directions. Selkirk (2009) (see also
Elfner 2012) has developed a vastly simplified approach to the syntax-prosody
mapping that distinguishes only three levels (word, phrase, and clause), and
syntactic constituents are systematically made to correspond to phonological
domains (Match Theory). In an independent line of research, a long string
of papers reaching back into the 1980s has convincingly demonstrated that
recursive structures are by no means an exclusive property of syntax, but also
play a crucial role in phonology. Even though at variance with strict layering
(Selkirk 1984; Nespor and Vogel 1986), the empirical existence of recursive
prosody is undeniable, as first demonstrated by Ladd (1986, 1988), whose
findings have been corroborated by Kubozono (1989, 1993,) Schreuder and
Gilbers (2004), Gussenhoven (2005), Wagner (2005, 2010), Schreuder (2006),
Kabak and Revithiadou (2009), Ito and Mester (2009b), Féry (2010), and
van der Hulst (2010), to name a few, undermining a central tenet of orthodox
prosodic hierarchy theory that supposedly sets phonology apart from syntax.
Building on these empirical findings, Ito and Mester (2007, 2009a, 2013) have
gone on to argue that, beyond its sheer existence, prosodic recursion allows
for a vast, and much-needed, simplification in the inventory of prosodic cat-
egories themselves. The empirically necessary subcategories that the data of
individual languages often seem to demand (such as the minor versus the
major phonological phrase of Japanese, long established under these names
since McCawley (1968), and rechristened into “accentual” versus “inter-
mediate phrase” by Pierrehumbert and Beckman (1988)) are not separate cat-
egories, each existing on its own in some (or all?) language(s), but are rather
instances of a single recursively deployed basic category. These results are
very much in harmony with central ideas in Match Theory, and recent work
(Selkirk 2011; Ishihara 2014) has successfully connected the two theories into
a larger framework.
One of the hallmarks of Match Theory is the idea that the main force inter-
fering with syntax-prosody isomorphism is not some kind of non-isomorphic
Ranked as in tableau (3),4 these constraints derive the different parses in (1).
The syntactic pattern [[u]x], where x = a or u, is parsed as the non-isomorphic
single φ (3ae) (ux), violating bottom-ranked Match-XP but satisfying higher-
ranked BinMin in the optimal way. However, [[a]x] is parsed as (3im) ((a)(x)),
violating BinMin: Isomorphic (3l) ((a)a) is out because the second a violates
(3) Japanese with NoLapse
[[u]u] a. ► (uu) *
b. (u(u)) *W *
c. ((u)u) *W L
d. ((u)(u)) **W L
[[u]a] e. ► (ua) *
f. (u(a)) *W *
g. ((u)a) *W *W L
h. ((u)(a)) **W L
[[a]a] i. ► ((a)(a)) **
j. (aa) *W L *W
k. (a(a)) *W *L *W
l. ((a)a) *W *L
[[a]u] m. ► ((a)(u)) **
n. (au) *W L *W
o. (a(u)) *W *L *W
p. ((a)u) *W *L
The most interesting case is [[a]u] parsed as (3m) ((a)(u)), with a rise on u,
as depicted in (4d). The main tonal events in these examples are indicated with
schematic pitch contours.
a. =(3a) ( u u )
d. =(3m) (( a )( u ))
Japanese with
Align-R and EqualSis
[[u]u] a. ► (uu) *
b. (u(u)) *W *W *
c. ((u)u) *W *W L
d. ((u)(u)) **W L
Japanese with
Align-R and EqualSis
[[u]a] e. ► (ua) *
f. (u(a)) *W *W
g. ((u)a) *W *W *W *W L
h. ((u)(a)) **W L
[[a]a] i. ► ((a)(a)) **
j. (aa) *W *W *L *W
k. (a(a)) *W *W *W L
l. ((a)a) *W *W *W *L
[[a]u] m. ► ((a)(u)) **
n. (au) *W L *W
o. (a(u)) *W *W *L *W
p. ((a)u) *W *L
They derive the different outcomes in Japanese and Basque by the ranking
scenario in (7), with the result shown in (8).
[[u]u] a. ► (uu) *
b. (u(u)) *W *L *
c. ((u)u) *W *L W
d. ((u)(u)) **W W
Basque with Align-R
and EqualSis
[[u]a] e. ► (ua) *
f. (u(a)) *W *W *
g. ((u)a) *W *W *W *W L
h. ((u)(a)) **W L
[[a]a] i. ► ((a)(a)) **
j. (aa) *W *W L *W
k. (a(a)) *W *W *L *W *W
l. ((a)a) *W *W *L *W
[[a]u] m. ► ((a)u) * *
n. (au) *W L L *W
o. (a(u)) *W * * *W
p. ((a)(u)) **W L
In our own analysis with NoLapse instead of Align-R and EqualSisters, the
Basque system emerges when NoLapse ranks below BinMin, as shown in (9).7
[[u]u] a. ► (uu) *
b. (u(u)) *W *
c. ((u)u) *W L
d. ((u)(u)) **W L
[[u]a] e. ► (ua) *
f. (u(a)) *W *
g. ((u)a) *W *W L
h. ((u)(a)) **W L
[[a]a] i. ► ((a)(a)) **
j. (aa) *W L *L
k. ((a)a) *W *L
l. (a(a) *W *L *W
[[a]u] m. ► (au) * *
n. (a(u)) *W L *
o. ((a)u) *W * L
p. ((a)(u)) **W L L
a. [[[u]u]u] ((uu) u) ((uu) (u)) W L
b. [[[u]a]u] ((ua) (u)) ((ua) u) L W
a. [[[u]u]u] ((↑u u) u) ((↑ u u)(↑u)) W L
b. [[[a]u]u] (((↑a↓)(↑u))(↑u)) (((↑a↓)(↑u)) u) L W
As we consider the two pairs, a way out might suggest itself since the two
cases differ in terms of the severity of sister inequality. This becomes clearer
when we inspect more detailed representations, with explicit indications of
projection levels, as in (15).
0 0 0
u u u u u u
b. winner loser
2 2
1 1
0 0 0 0 0
a u u a u u
Whereas the winner in (15a) has a pair of sister nodes (ω, φo), the loser
in (15b) has a pair (ω, φ1). EqualSisters theorists might seize on this diffe-
rence and expand the constraint, in the familiar OT manner, into a family
of constraints penalizing sister inequality of different degrees of severity.
Let us assume, for concreteness, that besides the general EqualSisters con-
straint (5) penalizing any difference in category between sister nodes, there is
a more stringent constraint penalizing a situation where a category inequality
is aggravated by a concomitant projection level inequality. We might call
the more stringent constraint EqualSisters-2, violated when λj is sister to
κi, with λ>κ and j>i. Ranked above BinMin, EqualSisters-2 removes the
problem, as (16b) shows.
a. [[[u]u]u] ((uu) u) ((uu) (u)) W L
b. [[[a]u]u] (((a)(u)) (u)) (((a)(u)) u) W L W
c. [[[u]a]u] ((ua) (u)) ((ua) u) W L W
d. [[[u]a]a] ((ua) (a)) ((ua) a) W L W
e. [[[u]u]a] ((uu) (a)) ((uu) a) W L W
f. [[[a]a]a] (((a)(a)) (a)) (((a)(a)) a) W W L L
g. [[[a]u]a] (((a)(u)) (a)) (((a)(u)) a) W W L L
h. [[[a]a]u] (((a)(a)) (u)) (((a)(a)) u) W W L L
The evidence for these strictly binary prosodic parses (due to Kubozono
1989, 1993) are the initial rises (marked by up-arrows) in every φ and, in
the case of accented sequences, the extra rhythmic boost before the third
ω (indicated by the larger up-arrow). As shown earlier in (3), because of
undominated NoLapse-L and AccentAsHead, there are only four licit 2ω-
structures: (uu), (ua), ((a)(a)), and ((a)(u)), and joined into 4ω-structures they
yield the 4´4=16 combinatorial possibilities depicted in (19).
(19) ( uu ) ( uu )
( ua ) ( ua )
(( a )( a )) (( a )( a ))
(( a )( u )) (( a )( u ))
Why ((12)(34)), rather than the more closely matching ((12)3)4)? Ishihara
(2014), following up on an informal suggestion in Selkirk (2011), gives an
explicit OT analysis summarized in (20).
S: P:
S: P:
[[[[u]u] u] a. ► ((u u) u) * *
b. (u u u) * **W
c. ((u u)(u)) * *W *
The following tableau anticipates this point and shows that BinMax-φ/ω
does not distinguish between (23a) and (23d), but BinMaxBranch-φ does.
On the other hand, BinMaxBranch-φ does not distinguish between (23a-c).
S: P:
S: P:
[[[[u]u] u] u] a. ► ((uu)(uu)) * * **
b. (uu) (uu) *W L * ***W
c. ((uu)u)u) **W L *L
[[[[a]a] a] a] d. ► (((a)(a))((a)(a))) * * **** **
e. ((a)(a)) ((a)(a)) *W L * **** ***W
f. ((((a)(a))(a))(a)) **W L **** *L
(27) Match-XP[+max]
S: P:
[[[a]u]u] a. ► (((a)(u))(u)) 1 3
b. (((a)(u))u) 1 1W 2L 1W
c. ((a)(uu)) 1 1W 1L 1W
d. ((a(u))(u)) 1W 1 2L 1W 1W
e. ((au)u) 1W 1 L 1W 1W
S: P:
[[[[u]u]u]u] ► ((uu)(uu)) 1 1 2
(((u)u)(uu)) 1 1 1W 1L 1W
(((uu)u)u) 2W L 1W 1L 2W
(uuuu) 1W 1 L 3W
(uu)(uu) 1W L 1 3W
The only crucial difference between the two systems is the ranking of
NoLapse-L and BinMin-φ:
prosody mapping
syntax-prosody mapping
1 Part of this research was presented at the 1st International Conference on Prosodic
Studies (ICPS-1): Challenges and Prospects, June 2015, Tianjin, China, where we
benefited from fruitful discussions with many conference participants, in particular,
Carlos Gussenhoven, Ellen Kaisse, Chi-Lin Shih, Irene Vogel, and Hongming
Zhang. We are grateful to Shin Ishihara, Sara Myrberg, and Alan Prince for pro-
ductive discussions of many of the issues dealt with in this chapter. Special thanks to
the 2015 syntax-prosody proseminar participants at UC Santa Cruz, where the core
of the analysis was developed in discussions with Jeff Adler, Jenny Bellik, Steven
Chomsky, N., and Halle, M. (1968) The sound pattern of English. New York:
Harper & Row.
Elfner, E. (2012) Syntax-prosody interactions in Irish. Ph.D. Diss. University of
Massachusetts, Amherst.
Elordieta, G. (2007) “Minimum size constraints on intermediate phrases”. Retrieved
15 August 2015,
Féry, C. (2010) “Recursion in prosodic structure”, Phonological Studies, 13, pp. 51–60.
Gussenhoven, C. (2004) The phonology of tone and intonation. Cambridge: Cambridge
University Press.
Gussenhoven, C. (2005) “Procliticized phonological phrases in English: Evidence from
rhythm”, Studia Linguistica, 59(2/3), pp. 174–193.
Hulst, H. van der. (2010) “A note on recursion in phonology” in Hulst, H. van der (ed.)
Recursion and human language. Berlin: Mouton de Gruyter, pp. 301–342.
Ishihara, S. (2014) “Match Theory and the recursivity problem” in Kawahara, S., and
Igarashi, M. (eds.) Proceedings of FAJL 7: Formal Approaches to Japanese Linguistics.
MIT Working Papers in Linguistics 73. Cambridge, MA: MIT Press, pp. 69–88.
Prosodic studies of two
Chinese dialects
Hongming Zhang
10.1 Introduction
The nature of the syntax- phonology interface is of crucial importance
to prosodic phonology and it involves two fundamental problems: (i) how
accessible is syntactic information to phonological processes and (ii) what
grammatical properties are relevant to phonology? Looking back at the his-
tory of syntax-phonology interface research, we have seen it divided into two
phases: the phase before the Optimality Theory (hereafter OT) and the phase
after OT. This chapter discusses some interface issues through the case studies
of Xiamen and Pingyao, two dialects of Chinese, and tries to prove that OT
fails to capture the nature of tone sandhi (hereafter TS) in the cases of both
Xiamen and Pingyao by brutal force or ad-hoc constraints, and that interface
theory within the OT framework does not have explanatory power superior
to that of the theory proposed before the OT era. Section 10.2 of the chapter
presents the theoretical background, Sections 10.3 and 10.4 center on the case
study of Xiamen Chinese and Pingyao Chinese, respectively, Section 10.5
summarizes the discussion, and Section 10.6 offers the conclusion.
a. ALIGN (Xo, L; ω, L)
b. ALIGN (Xo, R; ω, R)
a. ALIGN (XP, L; φ, L)
b. ALIGN (XP, R; φ, R)
Condition (1) indicates that each lexical word needs to have its left or right
edge align with the left or right edge of the prosodic word. Constraints in
(2) state that the right or left edge of any XP in the morphosyntactic struc-
ture coincides with the right or left edge of some phonological phrase in the
prosodic structure.
Align constraints require that the edge of syntactic structure matches that
of the prosodic category. They target a group of conditions, thus covering
different levels of the mapping between syntax and phonology. These two
constraints were later referred to as “ALIGN-XP, L” and “ALIGN-XP, R” or
“ALIGN-XP” in Truckenbrodt (1995, 1999), asking that the right edge of a syn-
tactic phrase align with the right edge of a phonological phrase, and the left
edge of a syntactic phrase align with the left edge of a phonological phrase.
Alignment Theory is actually another version of Edge-based approach (Chen
1985, 1987; Selkirk 1986) based on interpreting it in OT terms.
Another theory related to the study of interface is the Wrapping Theory
within the OT framework (Truckenbrodt 1999). It proposes wrapping each
syntactic phrase within a phonological phrase: Wrap (XP; φ). Both alignment
constraint and wrapping constraint are typically the constraints of interface
between syntax and phonology. Moreover, there is Match Theory (Selkirk
2006, 2009, 2011), as given in (3).
e. Match Word
A word in a syntactic constituent structure must be matched by
a corresponding prosodic constituent, call it ω, in phonological
These match constraints call for the constituent structures of syntax and
phonology to correspond. This predicts a strong tendency for phonological
domains to mirror syntactic constituents. The view to be argued is that the
phonological constituent structure produced for individual sentences in indi-
vidual languages is the result of syntactic constituency- respecting match
constraints. Moreover, in identifying distinct prosodic constituent types
(ι, φ, ω) to correspond to the designated syntactic constituent types, the
Match Theory embodies the claim that the grammar allows the fundamental
syntactic distinctions between clause, phrase, and word to be reflected in the
phonological representation.
In addition to the constraints discussed in the OT framework, four general
constraints that are entailed by the Strict Layer Hypothesis are also proposed
(Selkirk 1995), as given in (4).
The Strict Layer Hypothesis is the well formedness condition for the tree
diagram of prosodic hierarchy, and determines the organization of the pros-
odic hierarchical structure and constrains the prosodic constituents that serve
as domains of phonological rules application. The structural relationship
existing among different prosodic constituents has a direct bearing on the
construction principles of prosodic hierarchy.
Within the OT framework, there are many constraints for different types
of prosodic units. For instance, BinMin (φ, ω) requires that a phonological
phrase contain at least two prosodic words, while BinMax (φ, ω) demands
that a phonological phrase can be formed by two prosodic words at the most.
These two constraints can be summed up as follows in (5).
The Align-Wrap Theory and the Match Theory reflect the most recent
development in the study of the interface between syntax and phonology.
They present different predictions for prosodic structures. When Wrap-XP,
Align-XP, and non-recursivity interact, they might predict different types
of relations between syntax and phonology (Truckenbrodt 1999). In a non-
recursive high-ranking language, for a VP with two internal arguments like
[NP NP V]VP, Wrap-XP and Align-XP can be applied to predict and derive
these three different prosodic structures, as seen below.
44 21
b. Checked Syllable:3
(i) 5 → 21 (-p, -t, -k)
21 (- q)
(ii) 3 → 5 (-p, -t, -k)
53 (- q)
In tableau (10), the AvP yi-king in candidates (b) and (d) has no
corresponding phonological phrase on its right. And moreover, because yi is a
DP, which belongs to the functional category, it works only with lexical items
instead of functional items by the constraint of Lexical Category Condition
(LCC) (Truckenbrodt 1999). Having no corresponding phonological phrase
label on the right of DP does not violate any interface conditions. Therefore,
these two candidates both fulfill the constraint requirement of Align- R.
However, because the AvP yi-king combines with the following verb tsau to
form a VP, this VP should be analyzed as a phonological phrase by Wrap-XP,
and therefore, both of these candidates get eliminated because they violate
Wrap-XP, which should have a higher ranking. As for candidates (a), (b), and
(c), none of their AvP yi-king has a corresponding phonological phrase on the
right, thus violating the constraint of Align-R, but since all of them meet the
constraint of Wrap-XP, they come out even, if without any other constraints.
Due to the fact that its yi is not considered in analysis at the prosodic phrasal
level, candidate (10e) violates the constraint and gets eliminated. As can be
seen, each of the phonological phrases listed violates *P-phrase, that is, the
markedness constraint, at least once. If compared with (10a), (10c) obviously
violates the *P-phrase constraint more seriously. Therefore, (10a) stands out
as the optimal form and the winner owing to the fact that the whole part in
(10a) can be analyzed as one phonological phrase. The non-recursivity con-
straint cannot be violated in Xiamen TS. For the candidates given below,
those violating the constraints of Recursivity and Exhaustivity and causing
an unnecessary increase in the number of phonological phrases have already
been eliminated.
Analyses of AvP as the adjunct of the sentences are given below.
Now, let us take a look at AvP as the adjunct of the sentence in (13).
As far as the TS group is concerned, the AvP used to modify the VP differs
from that used to modify the sentence. (a) in (12) and (13) are syntactic
structures, while (b) is a recursive prosodic structure, that is, a phonological
phrase. A syntactic phrase gets matched to a phonological phrase with its syn-
tactic DP and IP being eliminated by the Match condition (XP; φ).
The data of Xiamen TS can only help define the right edge of phonological
phrases, which means if a monosyllabic word keeps the form of its citation
tone unchanged in TS, the right edge of this word will be the right edge of
the phonological phrase. However, Xiamen TS cannot define the left edge
of phonological phrases. While Match Theory requires that the phonological
The TG formation in (14) not only points out that Xiamen TS depends on
functional categories, but also combines two different approaches, namely,
the end-based approach proposed by Selkirk (1986) and the relation-based
(19) IP
VP-adjunct V NP
(Note: V m-commands AP.)
(20) IP
sentential-adjunct V NP
(Note: V does not m-command AP.)
(22) IP
Compared with the preliminary version in (14), the revised version in (23)
also considers that functional relations with the head, instead of m-command,
are the key to the Xiamen TS. Different from (14), (23) emphasizes that the
adjunct only c-commands its lexical heads, not all of its heads. Since a sen-
tential adjunct is licensed by I (Infl), which is the head of a functional cat-
egory, it is a non-lexical head; thus, the TS rule must be blocked between a
sentential adjunct and its following elements, although the sentential adjunct
c-commands its following elements. But the adjuncts within the VP and NP
are different because both of them modify lexical heads, and, thus, the TS rule
must be applied between adjuncts and their heads. As for the cases in which
the TS rule must be blocked between the PP and the closely following verb,
according to Chen (1992), the NP (i.e., the XP between the P and verb) is an
argument rather than an adjunct, although the PP is the adjunct of the verb,
thus blocking the TS rule, as seen in (24).
(24) VP
[P [NP]ARG # ]adjunct V NP
Thus, it can be seen that the revised version in (23) by Chen not only solves the
problem in (16) and (17) but also works out a solution for the problem in (21).
Chen (1992) has conducted an analysis of case (25). In his opinion, the
adnominal adjunct QP in (15a) for the NP liok-yah-p’ih ‘video movie’, which
occupies an object position, is reanalyzed as an adverbial phrase as well as a
(31) S'
Top S
V' S'
The first question we want to ask is how ‘=’, which is put at the right edge
of the QP to symbolize the application of the TS rule, is obtained. According
to the TG formation in (23), ‘#’ should be assigned to the right edge of all
of XPs, except when an XP is an adjunct c-commanding its lexical head, for
which an ‘=’ should be put there instead. But in (31), the QP c-commands
only the verb tso ‘rent’ at its left without c-commanding any elements to its
right. By Chen’s analysis, the QP seems to be an adjunct c-commanding its left
head, thus gaining an ‘=’ at its right, although this QP does not have any c-
command relation with its right elements. Such an analysis is also suitable for
example (26b). But this analysis violates the locality conditions (Poser 1981,
1985; Steriade 1987), which maintain that the application of the TS rule to
the right should have nothing to do with the syntactic condition to the left.
The second question is concerned with “lexical head”. According to
(23), the TS rule must be blocked between the XP and the following elem-
ents, except when the XP is an adjunct c-commanding its lexical head. Before
discussing the problem involved in (23), let us briefly present Chinese phrase
structures first (Huang 1982, 1991; Tang 1990). In the notation of X’-theory,
every phrasal category is a projection of a zero-level category in terms of the
following formalization.6
(32) a. X’ = X X”*
b. X” = X”* X’
The TG formation in (33) can account for, without any exception, all of the
data mentioned above. Adjuncts in both example (16) and (17) m-command
their following heads, but since the head of the former is a verb while that of
the latter is an Infl, the TS rule can be applied only to (16), and is blocked in
(17), as shown respectively in (34) and (35).
(34) IP
yi-king = tsau
(Note: AP m-commands V, i.e., the head of VP.)
(35) IP
tai-k’ai # tsau
(Note: AP m-commands Infl, i.e., the head of IP.)
(36) IP
(37) IP
AP1 V'
Now, let us consider the examples (25–30) in accordance with the TG for-
mation in (33).
In both (25b) and (26b), the QPs, as adjuncts, m-command the right head
lai ‘to’. Likewise, in example (30), the QP m-commands hoo ‘for’, the head of
CP on the right. So the TS rule must be applied to (25b), (26b), and (30), in
which the heads following the QPs are all complementizers and are all heads
of CP. The syntactic structure of (30) can be repictured as (38).
(38) IP
Spec I'
Spec I'
As for examples (27b), (28b), and (29), their syntactic tree structures are the
same as illustrated in (39), in which the QP as an adjunct cannot m-command
any of the elements on its right, thus blocking the TS rule.
(39) IP
Spec I'
Therefore, it can be seen that the TG formation in (33) can account for all
of the data here.
(40) IP
Spec I'
I VP-shell
Spec V'
t [e] t
Thus, it can be seen that the definition in (33) differs from that in (23), in
that the former maintains that the TS rule is blocked by an empty category,
while the latter holds that it is blocked by functional words. However, the TS
rule is still applicable even if functional heads on the right are m-commanded
by an adjunct, and this has been proved by lai ‘to’ in (25b) and (26b) as well
as hoo ‘for’ in (30).
T1 / T2 LM LMq MH HM HMq
T1 / T2 LM LMq MH HM HMq
In the above two tables, the leftmost column and the top row show the
form of the citation tones of the first and the second syllable, respectively. The
intersections of the columns and the rows indicate the sandhi tone forms of
bi-tonal sequences.
The tones of LMq and HMq can be considered as the allotones of LM and
HM, respectively, because they have the same TS patterns. Thus, the patterns
of TSA can be simplified as (44).
T1 / T2 LM MH HM
The rules of TSB are more complicated. Besides regressive rules, progres-
sive rules and bidirectional rules will also be applied, as shown below.
(49) 耕地 豇豆
‘till soil’ ‘cowpea’
Functional type argument non-argument
Syntactic type verb-object (VO) modifier-noun (MH)
Tone sandhi type type A (TSA) type B (TSB)
Citation tone LM -MH LM -MH
Sandhi tone ML -MH LM –LM
ok MLq - MH - HMq
ii. L→R [ MLq ] by TSA
[ LM ] by TSA
* MLq - LM - HMq
iii. R → L [ HMq ] by TSB
[ LM ] by TSB
* LMq - LM - HMq
As shown in (50), only the cyclic mode will bring about the correct output
form. In the derivations above, labeled brackets […]A and […]B stand for func-
tional units of type A or type B, which select for TSA or TSB respectively on
each cycle. Some other examples, however, suggest a non-cyclic mode, seen as
(51) a
* MH - HM - LM
ii. R L [ LM ] byTSA
[ LM ] by TSA
ok LM - LM - LM
* HM - LM - LM
ii. L R [ NA ] by TSB
[ HM ] byTSB
ok HM - MH - HM
Apparently, in the cases of (50) and (51), the functional information for
internal structures is ignored. Moreover, TS rules apply iteratively, with the
functional relation holding on the outer structures that determine both the
applicable rule (TSA or TSB) and the direction of application (right to left or
left to right). Without going into the details, the overall patterns of Pingyao
TS can be laid out as follows.
x1 x2 x3 x1 x2 x3
--A-- --A--
--A -- --A--
(A3) A (A4) A
x1 x2 x3 x1 x2 x3
--A-- --B--
--A-- --A--
Type B Left-branching
(B1) B (B2) B
x1 x2 x3 x1 x2 x3
--A-- --B--
--B-- --B--
(B3) B (B4) B
x1 x2 x3 x1 x2 x3
--B-- --B--
--B-- --B—
Reg(2) H
Dur(B), Pres( 1, C)
Pres( 1, R)
Some of the constraints for TSA are also available to TSB, which are
Num(Inf) ≤ 2, Num(Inf) ≥ 1, Dur(B), Pre(σ1, C), Pre(σ1, R), and Pres(HM).
Nevertheless, the constraints specifically for TSB are given in (55).
Under this constraint, the three syllables will be parsed as either (σσ)σ or
σ(σσ), in order to prevent those unparsed structures from being chosen, and
a constraint that demands every syllable in the input be parsed into a TSD is
needed, as shown in (59).
Ranking the Parse constraint higher than the Binary constraint, the
unparsed structures will be ruled out.
Chen (1990, 2000) discussed the directionality of TS rules for type A and
type B constructions: TS scans construction A right to left and scans con-
struction B left to right. If we redefine that constructions A and B correspond
to the phonological phrase and prosodic word respectively, the directionality
of TS in Pingyao can be rewritten because the TS rule scans a phonological
phrase from right to left and scans a prosodic word from left to right. Then,
the alignment constraints can be proposed under the OT framework, stated in
(60) and (61), respectively.
(60) Align (TSD, φ’)R: The right edge of every TS domain is aligned with
the right edge of the maximal phonological phrase.
(61) Align (TSD, ω’)L: The left edge of every TS domain is aligned with
the left edge of the maximal phonological word.
Following Ito and Mester (2012), we can refer to the larger structure of
the tri-syllabic string as the maximal prosodic category. It should be noted
that these two alignment constraints are not dominated in the prosodic hier-
archy, and consequently, the ranking of constraints for the tri-tonal sandhi in
Pingyao is as follows.
(62) Align (TSD, ω’)L /Align (TSD, φ’)R >> Parse σ >> Binary
σ (σ σ) *!
(σ σ)φ σ *! *
(σ) (σ σ) *! *
(σ σ)φ (σ) *! *
(σ (σ σ))φ’ *
((σ σ)φ σ)φ’ *! *
The constraints and recursive prosodic structures proposed here can pre-
dict the TSD to account for all eight TS patterns listed in (52), through restruc-
turing. However, there are two problems in this analysis. The first problem
is the property of the alignment. Generally speaking, the term “alignment”
refers to the correspondence of different domains, that is, the correspondence
between morphosyntactic category and prosodic category. But if the Align-
L (i.e., the domain of TS; maximal prosodic word) adopted in the analyses
considers the domain of TS a prosodic unit, the alignment constraint here will
be a correspondence between prosodic units only, rather than between mor-
phosyntactic units and prosodic units.
Another problem is the different TS behaviors of the embedded disyllabic
units in the tri-syllables. Of the eight tri-syllabic patterns, the performance of
the embedded disyllabic TS in the tri-syllable presents different properties.
Some have it made up by a prosodic word with the application of TSB, some
get it consisting of a phonological phrase with the application of TSA, and
some others contain no prosodic unit, and, therefore, have their application
of the TS rule decided by the property of outer maximal prosodic units. The
situation leads to difficulty in defining the domain of the embedded disyllabic
units in the tri-syllables as a consistent unit in the prosodic hierarchy. So, the
OT approach apparently fails to capture the TA patterns in Pingyao.
Since the rule of TSA applies right to left, it takes the rightmost element
X3 as the dominant element, which then determines the mode of rule appli-
cation by virtue of the c-command condition; TSB works in the same way
as TSA, but in a different direction. As seen from the principle in (69), in
Pingyao a functional relation determines the type of TS rule (TSA versus
TSB), while a syntactic condition (c-command) determines the mode of TS
rule application. Now let us use the principle in (69) to test all of the patterns
illustrated in (52).
In both (A1) and (A2) of (52), TSA applies iteratively right to left because
X3 c-commands both X2 and X1, illustrated by (70a). In (A3) and (A4), since
X3 does not c-command X1, TSA and TSB apply cyclically, seen as (70b). In
(B1) and (B2), TSA/B applies cyclically because X1 does not c-command X3,
as shown in (70c). In (B3) and (B4), since X1 c-commands both X2 and X3,
TSB applies iteratively left to right, as presented in (70d).
(70) a. = (A2)
journey long
b. = (A4)
move bed-roll
‘to move bed-roll’
X1 - X2 - X3
[ LM ] by TSB
[ NA ] by TSA
ok LM - LM - LM (cycle)
c. = (B1)
d. = (B3)
The principle in (69) can explain all of the cases in (52), which shows that
Pingyao uses a typical functional/syntactic condition, instead of neither a
foot condition as claimed by Chen (1990) nor an OT case proposed by Zhang
10.5 Discussion
The domain of rule application of Chinese TS has been a major topic in
studies on the interface between syntax and phonology. With the birth of the
OT framework, the phonological study seems to be split into two opposing
paradigms: that is, rule- based phonology versus constraint- based phon-
ology. Likewise, the interface study of syntax-phonology also gets split into
two opposing paradigms, that is, the direct reference approach (DRA) and
the indirect reference approach (IRA). But these two oppositions are not the
same in nature. The former is caused by a different understanding about the
ontology, that is, how to interpret the nature of phonology. In other words, the
question here is whether the phonological process is a derivational process or a
constraint-ranking process. As for the latter, it reflects the controversy over such
issues as whether syntactic information is accessible to phonological processes,
what syntactic properties are relevant to phonology, whether phonological rule
application refers to syntactic information directly or indirectly, whether syntax
1 Here tone shapes are symbolized by a numerical notation, where 5 equals the
highest and 1 equals the lowest on a 5-point scale. The last two tones are restricted
to “checked” syllables, while the other five co-occur with “free” syllables.
2 T stands for base tone, T’ for sandhi tone, and α for sandhi domain.
3 -
p, -t, -k, and -q here stand for the checked syllable, and -q for the glottal ending.
4 For a detailed discussion on the distinction between VP-adjunct and sentential
adjunct, see Tang (1990).
5 Here the symbol ‘#’ stands for the boundary between tone groups (TG), and
the TS rule is applied within TG but blocked across TG; the symbol ‘=’ is used
occasionally for highlighting the obligatory application of the TS rule at certain
junctions; and the letter ‘n’ for neutral tone.
6 In (32), where X* stands for zero or more occurrences of some maximal projection,
X is called a zero-bar projection, X’ a single-bar projection, and X” a double-bar
(or maximal) projection.
Chen, M. (1985) The syntax of Xiamen tone sandhi. MS, University of California
San Diego.
Chen, M. (1987) “The syntax of Xiamen tone sandhi”, Phonology Yearbook, 4, pp.
Chen, M. (1990) “What must phonology know about syntax?” in Inkelas, S., and Zec,
D. (eds.) The phonology–syntax connection. Chicago: University of Chicago Press,
pp. 19–46.
Chen, M. (1992) Argument vs. adjunct: Xiamen tone sandhi revisited. MS, University
of California San Diego.
Chen, M. (2000) Tone sandhi: Patterns across Chinese dialects. Cambridge: Cambridge
University Press.
Chen, M., and Zhang, H.- M. (1997) “Lexical and post- lexical tone sandhi in
Chongming” in Wang, J.-L., and Norval, S. (eds.) Studies in Chinese phonology,
vol. 1. Berlin: Mouton de Gruyter, pp. 13–52.
Cheng, R.-L. (1968) “Tone sandhi in Taiwanese”, Linguistics, 41, pp. 19–42.
Cheng, R.-L. (1973) “Some notes on tone sandhi in Taiwanese”, Linguistics, 100,
pp. 5–25.
Cheng, R.-L. (1991) “Interaction, modularization, and lexical diffusion: Tone sandhi
in Taiwanese verbs”. Paper presented at the 3rd North America Conference on
Chinese Linguistics, Ithaca, NY.
Chomsky, N. (1981) Lectures on government and binding. Dordrecht Holland: Foris.
Chomsky, N. (1986) Barrier. Cambridge, MA: MIT Press.
Chomsky, N. (1995) A minimalist program. Cambridge, MA: MIT Press.
Chomsky, N., and Halle, M. (1968) The sound pattern of English. New York: Harper
and Row.
Chung, R.-F. (1989) Aspects of Ke-jia phonology. Ph.D. Diss., University of Illinois,
Duanmu, S. (1990) A formal study of syllable, tone, stress and domain in Chinese
Languages. Ph.D. Diss., Massachusetts Institute of Technology.
Part IV
Perceptual development of phonetic
categories in early infancy
Consonants, vowels, and lexical tones
Jun Gao and Rushen Shi
F0 (Hz)
F0 (Hz)
Figure 11.1 (a) Pitch trajectories of example stimuli of Tone 2 and Tone 3. The broken
part in the mid-section of the Tone 3 pitch curve stands for creaky voice.
(b) Pitch trajectories of example stimuli of Tone 1 and Tone 4
4 Same
T2–T3 T1–T4
Figure 11.2 Results of both younger and older Mandarin-learning infants for the
Tone 2 –Tone 3 (left two columns) and for Tone 1 –Tone 4 (right two
columns) contrasts. Looking times (means and standard errors) were sig-
nificantly longer in Different than in Same test trials
Author notes
The experiment reported in this chapter formed part of the doctoral thesis
of the first author. The data were presented in the 2010 Speech Prosody and
2011 BUCLD meetings. This research was supported by grants from the
National Social Science Fund of China (Project No.: 08AYY02) and from the
Natural Sciences and Engineering Research Council of Canada (NSERC).
Corresponding authors for this article: Rushen Shi,; Jun
Best, C. T. (1995) “A direct realist view of cross-language speech perception” in
Strange, W. (ed.) Speech perception and linguistic experience: Issues in cross-language
research. Timonium, MD: York Press, pp. 171–204.
Best, C. T., Mcroberts, G. W., and Sithole, N. M. (1988) “Examination of percep-
tual reorganization for nonnative speech contrasts: Zulu click discrimination by
English-speaking adults and infants”, Journal of Experimental Psychology: Human
Perception and Performance, 14(3), pp. 345–360.
Harrison, P. (2000) “Acquiring the phonology of lexical tone in infancy”, Lingua,
110(8), pp. 581–616.
Jusczyk, P. W., Cutler, A., and Redanz, N. (1993) “Preference for the predominant
stress patterns of English words”, Child Development, 64, pp. 675–687.
Kuhl, P. K. (1991) “Human adults and human infants show a ‘perceptual magnet
effect’ for the prototypes of speech categories, monkeys do not”, Perception and
Psychophysics, 50, pp. 93–107.
Kuhl, P. K., Stevens, E., Hayashi, A., Deguchi, T., Kiritani, S., and Iverson, P. (2006)
“Infants show a facilitation effect for native language phonetic perception between
6 and 12 months”, Developmental Science, 9(2), pp. F13–F21.
Kuhl, P. K., Williams, K. A., Lacerda, F., Stevens, K. N., and Lindblom, B. (1992)
“Linguistic experience alters phonetic perception in infants by 6 months of age”,
Science, 255, pp. 606–608.
Liu, L., and Kager, R. (2014) “Perception of tones by infants learning a non-tone lan-
guage”, Cognition, 33(2), pp. 385–394.
Mattock, K., and Burnham, D. (2006) “Chinese and English infants’ tone percep-
tion: Evidence for perceptual reorganization”, Infancy, 10(3), pp. 241–265.
Mattock, K., Molnar, M., Polka, L., and Burnham, D. (2008) “The developmental
course of lexical tone perception in the first year of life”, Cognition, 106(3), pp.
Mattys, L., and Jusczyk, P. W. (2001) “Phonotactic cues for segmentation of fluent
speech by infants”, Cognition, 78, pp. 91–121.
F0 development in Cantonese
pre-adolescent children
Wai-Sum Lee
12.1 Introduction
There have been a number of frequency studies of F0 (pitch) development
in children. Kent (1976), in a survey of acoustic studies of children’s speech,
shows that there is an overall drop in F0 throughout the developmental course
from infancy to adulthood. For both males and females, F0 is at its highest of
about 400–500 Hz during the first year, and it decreases sharply to about 300
Hz over the first three years, which is followed by a gradual drop in F0 to about
250 Hz when reaching the onset of puberty at age 11 or 12. Between the two
genders, a significant difference in F0 emerges after age 11, and the difference
becomes more apparent after age 13, due to a further drop in F0 to about 100–
150 Hz for males during the period from 13 to 17 years of age. From infancy
to adulthood, males undergo an overall decrease in F0 of approximately two
octaves, but for females it is just over an octave. Kent points out that the gen-
eral pattern of the developmental course requires careful consideration, as the
amount of the data on F0 development from different age groups is limited.
Also, the F0 data presented in past studies are not always comparable to each
other, due to differences in test material, analysis method, and number of
Eguchi and Hirsh (1969) and Lee, Potamianos, and Narayanan (1999) are
two large-scale studies of the developmental change of speech in children
of a wide range of ages. Both analyze the acoustic properties of American
English vowels. In Eguchi and Hirsh (1969), the vowels are [i æ u ɛ a ɔ] from
84 subjects, including children aged from 3 to 13 and adults of both genders,
and in Lee, Potamianos, and Narayanan (1999), the vowels are [i ɪ ɛ æ ɑ ɔ ʌ ʊ
u ɝ] from 436 children and adolescents aged from 5 to 18 and 56 adults, with
males and females in each age group. Results of the two studies show that for
both genders, the F0 value decreases gradually with increasing age throughout
the pre-adolescent period before age 11. In Eguchi and Hirsh (1969), a sub-
stantial drop in F0 is observed in male children from 11 to 13, which is taken
to indicate the onset of the adolescent voice change. In contrast, the F0 drop
is gradual and small in female children from 11 to 13. Furthermore, the F0 in
male children at age 13 (221.1 Hz) is about an octave higher than that in male
adults (124.2 Hz), but the difference in F0 is small between female children
F0 development 333
and female adults. A comparison of the F0 development data of Cantonese
and those of English as reported in the previous studies will also be made to
explore the language factor in F0 development.
12.2 Method
12.2.1 Subjects
In this study, speech data were collected from a total of 100 native Cantonese
speakers, comprising 90 pre-adolescent kindergarten or primary school chil-
dren and 10 university young adults. The children formed nine consecutive
age groups from 4 to 12 years, and in each age group there were five males and
five females. The ten adults, five males and five females, were in their early 20s.
The means and standard deviations of the ages of five children of the same
gender in each of the nine age groups are presented in Table 12.1. For male or
female children, the age difference between any two consecutive age groups is
one year plus/minus two months, thus ranging from 10 to 14 months. Between
male and female children of the same age group, the age difference is one
or two months. All the speakers were born in Hong Kong and grew up in
a monolingual Cantonese-speaking family. They do not have a history of
speech and hearing problems. This study passed the ethical screening process
of the Research Committee at the City University of Hong Kong and received
prior parental consent for each child participant.
Table 12.1 Means (n = 5) and standard deviations (SD) of the ages of male and
female children of nine age groups from 4 to 12 years
4 4; 6 1.87 4; 5 2.74
5 5; 8 2.49 5; 6 1.52
6 6; 6 1.41 6; 7 0.89
7 7; 6 2.49 7; 5 2.07
8 8; 8 2.17 8; 7 0.89
9 9; 5 2.59 9; 7 1.10
10 10; 5 0.89 10; 5 0.71
11 11; 4 0.89 11; 5 1.00
12 12; 4 1.34 12; 4 2.28
12.3 Results
This section presents the mean or average F0 value for each of the Cantonese
level tones [55 33 22] uttered by the children of the nine age groups and adults,
male and female. The mean F0 values (See Table 12.2) are compared (i) among
children of the nine age groups from 4 to 12 years, (ii) between children of
each age group and adults of the same gender, and (iii) between the two
genders within each age group and across the age groups. Statistical analyses,
ANOVA and t-test, were performed to determine the significance level of the
between-group differences in F0.
F0 development 335
Male speakers
F0 (in Hz)
100 22
4 5 6 7 8 9 10 11 12 20s
Age group
Fe ale speakers
F0 (in Hz)
100 22
4 5 6 7 8 9 10 11 12 20s
Age group
Figure 12.1 Developmental (mean) F0 change in the Cantonese tones [55 33 22] for
male (upper panel) and female (lower panel) children at 4 to 12 years of
age and adults in early 20s
F0 development 337
Table 12.2 The mean F0 values (in Hz) of the Cantonese tones [55 33 22] for male
and female children at 4 to 12 years of age and adults in early 20s
Table 12.3 Ratios of the mean F0 values (in Hz) of the Cantonese tones [55 33 22]
for children at 4 to 12 years of age to those for adults in early 20s of the
same gender
to 1.1 for the three tones, indicating only a slight difference in F0 between the
two genders. At age 12, the F0 ratios of females to males increase to a range of
1.2 to 1.3 for the three tones, and the F0 differences between the two genders
are significant for the tones [55] (p < .001), [33] (p < .001), and [22] (p < .01).
The F0 data suggest the gender difference in children’s voice emerges at the
age of 12 years.
12.3.4 F0 differences among the three Cantonese level tones [55 33 22]
Similar patterns of the differences in F0 of the three Cantonese level tones [55
33 22] are observed between the children of any age group and adults, male
or female. For all the speakers, as expected, the F0 is higher for [55], followed
by [33] and [22] in decreasing order, and there is a larger F0 difference or tonal
space between [55] and [33] than between [33] and [22]. Table 12.5 presents the
F0 ratios of [55] to [33] and [33] to [22] for male and female children at the ages
of 4–12 years and for male and female adults. For children of the nine age
groups, the F0 ratios of [55] to [33] range from 1.10 to 1.16 for males and 1.10
to 1.19 for females, that is, the F0 is 10 percent to 16 percent (males) or 10 per-
cent to 19 percent (females) higher for [55] than [33]; the F0 ratios of [33] to [22]
range from 1.03 to 1.08 for males and 1.05 to 1.08 for females, that is, the F0
is 3 percent to 8 percent (males) or 5 percent to 8 percent (females) higher for
[33] than [22]. Thus, the F0 difference is about twice larger between [55] and [33]
than between [33] and [22], which corresponds to the tone space differences as
suggested in the tone letters [55 33 22]. A similar pattern of differences in F0 of
the three level tones is also observed for the adults, though the F0 ratios of [55]
to [33], that is, 1.19 for both male and female adults, and the F0 ratios of [33] to
[22], that is, 1.11 for male adults and 1.13 for female adults, are slightly higher
than the F0 ratios for children. The data indicate that children at four years of
F0 development 339
Table 12.5 F0 ratios of the Cantonese tones [55] to [33] and [33] to [22] for male and
female children at 4 to 12 years of age and adults in early 20s
age have acquired the adult-like pattern of tonal space in Cantonese. Thus, the
tonal spaces among the three Cantonese level tones [55 33 22] are maintained,
irrespective of the difference in absolute F0 values of the tones.
12.4 Discussion
The general patterns of the age-related and gender-related developmental F0
change in Cantonese-speaking children presented above are similar to those
in English-speaking children reported in the previous cross-sectional studies
of the F0 development, such as Eguchi and Hirsh (1969) and Lee, Potamianos,
and Narayanan (1999). To demonstrate the cross-language similarities and
differences in F0 development, the F0 data on the English vowel [ɑ] from pre-
adolescent children at 5–12 years of age and young adults of age 18 presented
in Lee, Potamianos, and Narayanan (1999) are compared with the F0 data
on the comparable Cantonese vowel [a]in the present study. As the youngest
English children in Lee, Potamianos, and Narayanan (1999) are five years
old, the F0 data from Cantonese children at age four are not included for
Figure 12.2 shows the superimposed developmental F0 curves for male
(upper panel) and female (lower panel) speakers of Cantonese and English.
The curves are plotted based on the mean F0 values for the English [ɑ] and
those for the three Cantonese tones [55 33 22] on the vowel [a]from children
of the same age and gender groups of the two languages. As shown in the
figure, the F0 values for English children are similar to the F0 values of the
high-level tone [55] for Cantonese children. This is true for male and female
children of the eight age groups, except the age group of the five-year-olds,
where the F0 value for English children is much closer to the F0 value of the
tone [33] than the tone [55] in Cantonese.
Male speakers
F 0 (in Hz)
150 33
5 6 7 8 9 10 11 12 18/20s
Age group
Fe ale speakers
F 0 (in Hz)
150 33
5 6 7 8 9 10 11 12 18/20s
Age group
Figure 12.2 Developmental change in the (mean) F0 values (in Hz) of the English [ɑ]
(Lee, Potamianos, and Narayanan 1999) and the Cantonese [a]associated
with each one of the three tones [55 33 22] for male (upper panel) and
female (lower panel) children at 5 to 12 years of age and adults aged 18
or in early 20s
F0 development 341
Table 12.6 F0 ratios of children to adults of the same gender for Cantonese and
indicate the onset of voice change from early childhood to middle childhood.
Such a voice change is assumed to take place before age five for English chil-
dren. Another difference between Cantonese and English is the abrupt F0
drop from age 11 to age 12 for Cantonese male children, which is taken as an
indication of the onset of adolescent voice change. The drop in F0 between
age 11 and age 12 for English male children is to a lesser degree, suggesting
that the adolescent voice change in English children has yet to start. There is
a lack of a similar F0 drop for female children of both languages from 11 to
12, suggesting that the similar voice change in male children does not occur in
female children at ages 11 and 12.
The F0 values for children and those for adults of the same gender are
compared for Cantonese and English. Table 12.6 presents the F0 ratios of
children of each age group to adults of the same gender for both languages.
The Cantonese data are based on the mean F0 values for the tone [55] on the
vowel [a]and the English data on the mean F0 values for the vowel [ɑ] reported
in Lee, Potamianos, and Narayanan (1999). As presented in the table, for both
Cantonese and English, the difference in F0 between male adults and male
children of any age groups is large, with the F0 ratio of children to adults close
to 2.0 in all but a single case. Thus, the F0 for male children is approximately
an octave or twice higher than that for male adults. A large drop in the F0 ratio
of male children at age 12 to male adults is observed for Cantonese (down
to 1.456) but not English (down to 1.879). The difference is assumed to be
due to the disparity in the onset time for the adolescent voice change in male
children between the two languages. For both languages, the F0 difference is
small between female children and female adults, with the ratios of children
to adults ranging from 1.0 to 1.2. At ages 11 and 12, for both Cantonese and
English, the F0 ratio of female children to female adults is close to 1.0, indi-
cating a similarity in voice between female children and female adults. Thus,
5 1.082 1.023
6 1.056 0.971
7 1.042 1.060
8 1.068 1.105
9 1.097 1.047
10 1.037 1.023
11 1.051 0.988
12 1.290 1.004
20s 1.911 1.952
for both languages the developmental F0 change is smaller for female children
and ends earlier in female children than male children.
The gender- related F0 changes between the two languages are also
compared, based on the mean F0 values for the Cantonese tone [55] on the
vowel [a]in the present study and the F0 values of the English vowel [ɑ] from
Lee, Potamianos, and Narayanan (1999). Table 12.7 presents the F0 ratios of
females to males of each age group for the two languages. As presented in the
table, the F0 ratios of female children to male children are just about 1.0–1.1
in most cases for Cantonese and English. This indicates that the gender diffe-
rence in F0 in pre-adolescent children of the two languages is minimal, though
the F0 value tends to be slightly larger for female than male children of the
same age group. A noticeable difference is the marked increase in the female-
to-male ratio at age 12 for Cantonese (1.290) but not English (1.004). The
data indicate that the emergence of a gender distinction in F0 or pitch occurs
earlier in Cantonese children than English children. Between male and female
adults, as expected, there is a large gender difference in F0 for both Cantonese,
with the females to males F0 ratio of 1.911, and English, with the females to
males F0 ratio of 1.952. This indicates that the F0 for female adults is close to
twice the F0 for male adults, whether the speakers are of a tone or non-tone
12.5 Conclusion
This chapter has presented the age-related and gender-related developmental
patterns of the F0 values of the Cantonese level tones [55 33 22] in pre-
adolescent children, male and female. The significant findings of this study
are summarized as follows. First, the large F0 drop for all three Cantonese
tones between ages five and six for children of both genders suggests the onset
F0 development 343
of physical maturation, including laryngeal growth and the lengthening of the
vocal folds. Second, the large drop in F0 in Cantonese male children between
ages 11 and 12 marks the onset of the adolescent voice change. Third, the
relative differences in F0 value between the tones [55] and [33] and between the
tones [33] and [22] are maintained in children, male and female, at different
ages and in male and female adults, showing that for the tones to be linguis-
tically distinguishable, the tonal spaces are maintained despite the changed
absolute F0 values. Fourth, the similarity in the F0 developmental pattern
between Cantonese and English pre-adolescent children shows that (i) the
difference in the prosodic system, with Cantonese being a tone language and
English a stress language, does not appear to have an effect on the F0 devel-
opmental change, and (ii) may suggest a possible universal of F0 development
in children.
Baker, S., Weinrich, B., Bevington, M., Schroth, K., and Schroeder, E. (2008) “The
effect of task type on fundamental frequency in children”, International Journal of
Pediatric Otorhinolaryngology, 72(6), pp. 885–889.
Bennett, S. (1983) “A 3-year longitudinal study of school-aged children’s fundamental
frequencies”, Journal of Speech and Hearing Research, 26(1), pp. 137–141.
Bennett, S., and Weinberg, B. (1979a) “Sexual characteristics of preadolescent
children’s voices”, Journal of the Acoustical Society of America, 65(1), pp. 179–189.
Bennett, S., and Weinberg, B. (1979b). “Acoustic correlates of perceived sexual identity
in preadolescent children’s voices”, Journal of the Acoustical Society of America,
66(4), pp. 989–1000.
Busby, P. A., and Plant, G. L. (1995). “Formant frequency values of vowels produced
by preadolescent boys and girls”, Journal of the Acoustical Society of America,
97(4), pp. 2603–2606.
Curry, E. T. (1940) “The pitch characteristics of the adolescent male voice”, Speech
Monographs, 7(1), pp. 48–62.
Eguchi, S., and Hirsh, I. J. (1969) “Development of speech sounds in children”, Acta
Oto-laryngologica, Supplementum, 257, pp. 1–51.
Fairbanks, G. (1950) “An acoustical comparison of vocal pitch in seven-and eight-
year-old children”, Child Development, 21(2), pp. 121–129.
Fairbanks, G., Herbert, E. L., and Hammond, J. M. (1949) “An acoustical study of
vocal pitch in seven-and eight-year-old girls”, Child Development, 20(2), pp. 71–78.
Fairbanks, G., Wiley, J. H., and Lassman, F. M. (1949) “An acoustical study of vocal
pitch in seven-and eight-year-old boys”, Child Development, 20(2), pp. 63–69.
Hasek, C. S.,, Singh, S., and T. Murry (1980) “Acoustic attributes of preadolescent
voices”, Journal of the Acoustical Society of America, 68(5), pp. 1262–1265.
Hollien, H., Green, R., and Massey , K. (1994) “Longitudinal research on adolescent
voice change in males”, Journal of the Acoustical Society of America, 96(5), pp.
Kent, R. D. (1976) “Anatomical and neuromuscular maturation of the speech mech-
anism: Evidence from acoustic studies”, Journal of Speech and Hearing Research,
19(3), pp. 421–447.
13.1 Introduction
Research on second language (L2) acquisition in the past few decades has
shown that, although one’s first language (L1) has an effect on second lan-
guage sound patterns (i.e., “L1 transfer”), interlanguage grammars are also
restricted by some universal phonetic or phonological principles (Broselow
et al. 1998; Major 2001; Echman 2004, among others). Studies on the L2
acquisition of Modern Standard Chinese (hereafter, Chinese) tones has
uncovered some patterns that appear to be derived not from learners’ native
language nor from the target language. Rather, such error patterns often
reveal universally preferred structures (H. Zhang 2010, 2016). However, it is
still unclear exactly how universals can be accessible to adult L2 learners and
the extent to which specific phonetic mechanisms or phonological principles
shape the interlanguage grammar. The current study on L2 Chinese contour
tones attempts to 1) show how the position within a word affects production
accuracy, and 2) demonstrate a possible correlation between L2 tones and a
cross-linguistically common phonetic mechanism of tonal coarticulation, and
to inspire further research on this topic.
Contour tones are regarded as more marked (complex) than level tones
(Ohala 1978; J. Zhang 2002). Contour tones also show a higher degree of
vulnerability than level tones in L2 studies (H. Zhang 2010, 2016), and this
motivates the current study on further exploration of contour tones, in par-
ticular the positional effects, in L2 Chinese. H. Zhang (2015) investigates
how the position of a tone within a clause affects the production of L2
tones by examining the performance of all four Chinese tones at initial and
final positions of various prosodic units within sentences (referred to here as
“sentential” positional effects). However, the research on intertonal effects
of L2 tones, that is, how the tones affect one another in connected speech
thereby decreasing the intelligibility of non-native tones, remains under-
developed. Of particular interest in this study are anticipatory coarticu-
lation and carry-over coarticulation. Anticipatory coarticulation involves
one speech sound being influenced by subsequent sounds and carry-over
coarticulation involves one speech sound being influenced by preceding
13.3 Methods
In order to survey the performance of L2 contour tones in disyllabic words,
a phonological study consisting of a pre- test and a main experiment is
designed. The pre-test, a reading task of 48 monosyllabic morphemes, was
used to ensure that all participants were able to produce individual lexical
tones correctly.
13.3.1 Participants
Sixty-seven learners participated in the pre-test. Seven were excluded from
participation in the main experiment due to low accuracy rates in the pre-
test (below 85 percent). Among the 60 participants in the main experiment,
20 were native English speakers (12 males and 8 females), 20 were native
Japanese speakers (10 males and 10 females) from areas with Tokyo-type
pitch accent, and 20 were native Korean speakers from Seoul (8 males and
12 females). All participants had been learning Chinese for at least 6 months,
but no more than 18 months at the time of data collection, placing them at
approximately an intermediate level. The learners were recruited from a US
university on the East Coast and a university in China. All participants
claimed that Chinese was the only tonal language they had studied. Learners
participated in this study voluntarily, and each was paid for the recording
13.3.2 Procedure
For the main experiment, participants were given a list of sentences and were
asked to produce each in Chinese at a normal speed. Stimuli for the main
experiment were disyllabic words bearing all 16 possible combinations of the
four lexical tones. Although this study pays particular attention to T2 and T4,
it also includes the remaining lexical tones in order to obtain a broader per-
spective of tone performance, as well as for comparative purposes. Each of
the 16 possible tone combinations was presented in equal proportions. Two
words (consisting of different morphemes) for each tone combination type
The test words were embedded in sentences. The test words were used as
modifiers to modify nouns in the sentences. In order to avoid anticipatory
and carry-over effects from neighboring tones (Xu 1997), the tokens were
embedded in sentences where the preceding and following morphemes were
both the neutral-toned particle de. This way, any effect from the neutral tone
would be the same for all test tokens. In addition, these test words were placed
in a sentence-medial position to reduce the possible interference of sentence
intonation. Example (5) displays the carrier sentence structure:
13.3.3 Analysis
The correctness of L2 tonal production was judged within sentences.
The author, a native speaker of Chinese, judged whether or not the tonal
productions were acceptable by both listening to productions and measuring
pitches in Praat. Productions were marked as “correct” or “incorrect”. A tone
13.4 Results
This section reports the results obtained from the experiment with Section
13.4.1 focusing on general error patterns of T2 and T4 in disyllabic words and
Section 13.4.2 on intertonal effects, with particular attention paid to anticipa-
tory effects. Although individual differences are found in L2 tone productions,
I chose to look at these learners as a group when examining the results, in
order to posit generalizations concerning the positional effects in L2 tones.
following columns display the error rate information of the overall dataset
and that of the individual language groups, with the percentages of errors
in word-initial positions at left followed by errors in word-final positions.
Significant differences (p < 0.05) between the error rates in different positions
are highlighted in bold and are in shaded cells.
The most striking positional effects are found in the productions of T2
and T4. The word-initial T2 has a much lower error rate than word-final
T2, and the error rate of word-final T4 is significantly lower than word-
initial T4. This finding holds true across all three groups of speakers, and is
confirmed when examining the types of substitute tones participants used
when they made errors. In the study of L2 acquisition, substitution is not an
arbitrary process but instead stems from a process of avoidance and choice
by L2 learners. In the current dataset, it is found that 1) T4 is used more
often than T2 as a substitute tone for other tones when errors occurred;
and 2) T4 is substituted significantly more often for other tones at word-
final positions, while T2 is usually substituted for other tones at word-initial
positions. Table 13.3 lists the substitution patterns with positional infor-
mation. The first line contains the name of the language groups and the
total number of substitute tones in each group. The first column lists the
substitute tones, T2 and T4, employed by L2 learners when the target tones
were incorrectly produced. In each cell, the counts of the substitute tones
followed by their percentages are listed. The percentages under each L1 are
out of all substitute tones in that L1 data set. That is, the percentages are
out of total error numbers in specific L1 sets.
Findings indicate that positional errors are negatively correlated with pos-
itional substitution rates in most of the L2 productions. Where the error rates
for a tone are high, the use of that tone as a substitute for other tones is
low, and vice versa. Across all three groups of learners, the rising tone T2 is
“disfavored” in word-final positions, and the falling tone T4 is “disfavored”
Table 13.2 Error patterns with positional information
2 1
2 2
2 3
2 4
30.00% 1 2
20.00% 2 2
10.00% 3 2
0.00% 4 2
English Speakers Japanese speakers Korean speakers
4 1
4 2
50.00% 4 3
40.00% 4 4
30.00% 1 4
20.00% 2 4
3 4
4 4
English Speakers Japanese Speakers Korean Speakers
Response tones T2-T1 (25%) T2-T1 (53%) T2-T1 (33%) T2-T4 (43%) T2-T4 (61%) T2-T4 (51%)
T3-T1 (19%) T3-T1 (15%) T3-T4 (23%) T3-T4 (18%) T3-T4 (20%) T3-T4 (26%)
T1-T1 (13%) T1-T1 (15%) T3-T1 (20%) T2-T3 (18%) T1-T4 (6%) T3-T1 (8%)
and incorrect productions) and the percent at which they were produced out
of all response tones for target T2-T1 and T2-T4 sequences (i.e., word-initial
T2 followed by tones with high onsets) and target T4-T1 and T4-T4 (i.e.,
word-initial T4 followed by tones with high onsets). All the learners produced
erroneous T3-T1, T3-T4, T3-T1, and T3-T4 sequences more often than other
errors in instances when they were intending to produce T2-T1, T2-T4, T4-
T1, and T4-T4 sequences. T3 errors are boldfaced when they are widely used
as substitutions for the target T2 and T4 at word-initial positions.
A likelihood ratio test was used to determine whether T3 is the most fre-
quently produced error for target T2 and T4 at word-initial positions, when
followed by T1 or T4 (with the word-final tones correct). For example, if
the target tone sequence is T2-T4, the number of times T3-T4 was actu-
ally produced was compared with the number of times T1-T4 or T4-T4 was
produced. Table 13.7 displays the results of statistical analyses. Clearly, when
all language groups are pooled together (the ‘All’ column in Table 13.7), T3
is produced for target word-initial T2 and T4 significantly more often than
other erroneous tones. The data support the hypothesis that L2 contour tones
are constrained by the mechanism of anticipatory coarticulation.
All three language groups of L2 learners of Chinese erroneously produced
T3 most frequently when the target tone was a word-initial T2 or T4 followed
by tones with high onsets. As was the case for the previous two sections, this
was seen most clearly for T2, with results from T4 being less consistent.
13.5 Discussion
T2: T4:
(a)L H LH ( )H L HL
( )LH HL LH H ( )H L H H L HL
anticipatory dissimilation on T2s (cases (a) and (b)) and T4s (cases (c) and
(d)), is shown in Figure 13.4.
In cases (a) and (c), the H component is pushed even higher due to the L
component tones in the following syllable. This is a “pre-low raising” effect.
This kind of “enhancement” of H in target contour tones usually does not
influence the preservation of target tone identities. However, in cases (b) and
(d), the H component tones in T2 and T4 are lowered, triggered by the H
in the following syllable. This decreases the intelligibility of the L2 word-
initial T2 and T4 tones for native Chinese listeners, resulting in more errors
in L2 tones.
If we consider all of the cases in Figure 13.4 to be dissimilatory in nature,
the T2 patterns (i.e., (a) and (b)) indicate an immediate dissimilation since the
changes occur on contiguous tones. However, the T4 patterns (i.e., (c) and
(d)) are examples of distance dissimilation since the onset of the final syllable
triggers the contour changes of the beginning H component of the initial
syllable but not the immediate L component. Interestingly, more evidence for
(a) and (b) were found than for (c) and (d) in the present L2 study. That is,
while anticipatory coarticulation affects both T2 and T4 to similar degrees
in native Chinese (Xu 1997), in L2 tones it seems to affect T2 more easily
than T4. This is surprising given that T2 and T4 are equally grammatical and
are freely distributed in native Standard Chinese. In addition, there is no sig-
nificant difference between the frequency of T2 and T4 in the basic vocabu-
lary of Standard Chinese (Shang 2000). Since L2 learners do not have any
preexisting knowledge of lexical tones in their L1s, T2 and T4 are equally dif-
ficult for them. The asymmetry of T2 and T4 found in the present study thus
invites an analysis from the perspective of universal phonetic or phonological
constraints, such as the Tonal Markedness Scale.
The poorer performance and later acquisition of rising tones (T2) than
falling tones (T4) by both L1 and L2 learners have been observed in previous
studies (Li and Thompson 1977; Zhu and Dodd 2000; H. Zhang 2010, 2013).
However, in L2 tones the failure to realize target tone offsets of the first syl-
lable by L2 learners may lead to higher error rates, resulting in more notice-
able anticipatory effects compared to carry-over effects.
13.6 Conclusion
This study surveys the error patterns of non-native T2 and T4 in disyllabic
words made by 60 learners of Chinese with different L1 background. It is
found that T2 is produced with a higher rate of accuracy at word-initial
positions than in word-final positions, while T4 is produced with a higher rate
of accuracy at word-final positions than in word-initial positions. However,
some interesting intertonal effects that run contrary to this general finding
were also observed. T4 is performed noticeably better than other T4s when it
is followed by T3. The accuracy rates of T2 when followed by T3 and when
followed by another T2 are always higher than when T2 is followed by T1
or T4. By further examining the phonetic variation of correct productions
of T2 and T4 at correspondent positions and the error types for incorrect
productions, this study argues that the error grammar of L2 contour tones is
1 This chapter is developed from a pilot study which was presented at the First
International Conference on Prosodic Studies: Challenges and Prospects (ICPS-1),
13–14 June 2015, at the Tianjin Normal University, China.
2 The Tone Bearing Unit (TBU) in English and Korean is usually assumed to be a
syllable, but the TBU in Japanese is the mora (Gussenhoven 2004; Venditti 2005).
3 The underlying form of Tone 3 is under debate. Traditionally, Tone 3 is taken to be
underlyingly a low dipping tone (tone value [214]) that is only realized in isolated
syllables and in prosodic-final positions. The other T3 allotone, a low-level tone
(tone value [21] or [11]), has a much wider distribution. Mei (1977), Yip (1980,
2002), and other studies argue that the underlying form of Tone 3 is [21], and Zhang
(2014) proposes that [214] is the intonation form of Tone 3.
4 A striking exception is the substantial anticipatory coarticulation in Kinyarwanda
(Myers, 2003).
Bao, Z.-M. (1999) “Tonal contour and register harmony in Chaozhou”, Linguistic
Inquiry, 30(3), pp. 485–493.
Bent, T. (2005) Perception and production of non- native prosodic categories.
Unpublished doctoral diss., Northwestern University.
Boersma, P., and Weenink, D. (2011) Praat: Doing phonetics by computer [Computer
program]. Version 5.2.17. Retrieved November 2011,
Broselow, E., Chen, S., and Wang, C. (1998) “The emergence of the unmarked in
second language phonology”, Studies in Second Language Acquisition, 20(2),
pp. 261–280.
Broselow, E., Hurtig, R., and Ringen, C. (1987) “The perception of second lan-
guage prosody” in Ioup, G., and Weinberger, S. (eds.) Interlanguage phonology.
Cambridge, MA: Newbury House Publishers, pp. 350–362.
Brunelle, M. (2003) Coarticulation effects in Northern Vietnamese Tones. MS thesis,
University of Ottawa.
Chang, C. B., and Yao, Y. (2016) “Toward an understanding of heritage prosody:
Acoustic and perceptual properties of tone produced by heritage, native, and second
language speakers of Mandarin”, Heritage Language Journal, 13(2), pp. 134–160.
Language index
Subject index
accent 3–4, 112–116, 131, 141, 145–146, c-command 282–284, 286, 288, 292,
155n2, 155n6, 163, 199, 241, 253–254, 304–305, 307
257, 264, 271, 272n2, 346–347, 351 chain shift 279
accented syllable 111, 120, 131, 133n1, Chao’s letter 3, 162–164, 179, 188
150 child language 317, 323, 331–343
Accentual Phrase 347 citation tone 80, 201, 205, 216, 278–279,
acoustic cue 143–144, 199, 219, 281, 293–296, 307, 333
319, 328 clitic 2, 10, 15–16, 18, 21, 26, 28, 31–32,
acquisition 1, 4, 317–320, 329, 345, 347, 34–35, 38, 41–45, 47–52, 55n23,
350–351, 354, 363; acquisition of 80–89, 91–95, 97–99, 101–105, 228,
lexical tones 320 231–232, 234, 236–237, 239–240, 246,
Adjunction Approach 10, 12, 29, 32–33, 272n8
36, 51–52, 55 clitic group 2, 10, 14, 30, 53n4, 54n8, 80,
Advanced Tongue Root (ATR) 232 82–86, 94–95, 98–99, 101–105, 227,
affix 21, 25–26, 31, 33–36, 39–41, 44–47, 310n7
49–52, 54n17, 55n21, 82–83, 94, coarticulation 143, 205, 216, 236,
103, 105n12, 227, 234, 239–240, 247, 238, 247, 248n6, 345–346, 348–350,
248n6, 307 356–359, 361–365
Alignment Theory 12, 276, 279, 307 coda 20, 44, 54n13, 61–63, 65–66,
Align-Wrap Theory 278, 309 75, 77, 87, 116, 120, 153, 200, 229,
allotone 3, 159–160, 167, 169–170, 310n8, 352
172–175, 177–181, 189–190, 365 Composite Group 2, 10–11, 18–19, 28,
anticipatory coarticulation 143, 30, 34, 36, 39–43, 46, 49–53, 55n26,
345–346, 348–350, 357–359, 55n28
361–363, 365 Composite Prosody Model 10, 29, 34,
anticipatory effect 205, 207, 349, 358 36–37, 39, 43, 47–53, 55n27, 55n31
assimilation 26–27, 35–36, 41, 45–47, 86, compound 21–22, 26, 31, 34–35, 42–43,
227–230, 234–235, 246, 319, 348 45–46, 52, 54n18, 55n28, 155n4, 200–
Autosegmental Phonology 162 202, 205–207, 209, 213–220, 232–233,
base tone 200, 205–206, 216, 309n2 Compound Stress Rule 22, 26, 45–46
binarity constraints 262, 264–265, 269, consonant harmony 227–228, 231, 238,
271 240, 246–247, 248n8
binary 19–20, 74, 78, 162, 243, 253, constraint ranking 32, 55n27, 269–270,
255, 262–265, 278, 301–304, 308–309; 306
binary branching 20, 253, 264 contour tone 142, 152, 162, 322–323,
body 61, 63 328, 345–351, 357–359, 361–365
branching 20, 36, 253, 259, 262–264, contrastive focus 198, 201–202, 206, 215,
283, 298 217–220