Prosodic Studies - Challenges and Prospects

Download as pdf or txt
Download as pdf or txt
You are on page 1of 391


Prosodic Studies

Prosody is one of the core components of language and speech, indicating

information about syntax, turn-​taking in conversation, types of utterances,
such as questions or statements, as well as speakers’ attitudes and feelings.
This edited volume takes studies in prosody on Asian languages as well as
examples from other languages. It brings together the most recent research
in the field and also charts the influence on such diverse fields as multimedia
communication and SLA.
Intended for a wide audience of linguists that includes neighboring discip-
lines such as computational sciences, psycholinguists, and specialists in lan-
guage acquisition, Prosodic Studies is also ideal for scholars and researchers
working in intonation who want a complement of information on specifics.

Hongming Zhang is Professor and Head of the Chinese Language &

Linguistics Program at the University of Wisconsin-​Madison. He is also the
executive editor of International Journal of Chinese Linguistics, series editor of
Routledge Studies in Chinese Linguistics, and editor of the volume Phonology
and Poetic Prosody of The Encyclopedia of China (3rd edition). His recently
published books include Syntax-​Phonology Interface: Argumentation from
Tone Sandhi in Chinese Dialects and Tonal Prosody in Yongming Style Poems.

Youyong Qian is Associate Professor at the Institute of Linguistics, Chinese

Academy of Social Sciences. He received his PhD in Chinese Linguistics from
the University of Wisconsin-​Madison in 2015 and MA in Chinese Linguistics
from Hanyang University, Seoul, South Korea, in 2010. His research interests
include theoretical linguistics, phonology, Chinese historical phonology,
and language acquisition. His major publication is A Study of Sino-​Korean
Phonology: Its Origin, Adaptation and Layers (2018, Routledge).

Routledge Studies in Chinese Linguistics

Series editor: Hongming Zhang

A Study of Sino-​Korean Phonology

Its Origin, Adaptation and Layers
Youyong Qian

Partition and Quantity

Numerical Classifiers, Measurement and Partitive Constructions in
Mandarin Chinese
Jing Jin

Mandarin Loanwords
Tae Eun Kim

Intensification and Modal Necessity in Mandarin Chinese

Jiun-​Shiung Wu

The Architecture of Periphery in Chinese

Cartography and Minimalism
Victor Pan

Focus Manifestation in Mandarin Chinese and Cantonese:

A Comparative Perspective
Peppina Po-​lun Lee

Prominence and Locality in Grammar

The Syntax and Semantics of Wh-​Questions and Reflexives
Jianhua Hu

Prosodic Studies
Challenges and Prospects
Edited by Hongming Zhang and Youyong Qian

For more information about this series, please visit:​


Prosodic Studies
Challenges and Prospects

Edited by Hongming Zhang

and Youyong Qian

First published 2020

by Routledge
2 Park Square, Milton Park, Abingdon, Oxon OX14 4RN
and by Routledge
52 Vanderbilt Avenue, New York, NY 10017
Routledge is an imprint of the Taylor & Francis Group, an informa business
© 2020 selection and editorial matter, Hongming Zhang and Youyong Qian;
individual chapters, the contributors
The right of Hongming Zhang and Youyong Qian to be identified as the authors of the
editorial material, and of the authors for their individual chapters, has been asserted in
accordance with sections 77 and 78 of the Copyright, Designs and Patents Act 1988.
All rights reserved. No part of this book may be reprinted or reproduced or utilised
in any form or by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying and recording, or in any information
storage or retrieval system, without permission in writing from the publishers.
Trademark notice: Product or corporate names may be trademarks or registered trademarks,
and are used only for identification and explanation without intent to infringe.
British Library Cataloguing-​in-​Publication Data
A catalogue record for this book is available from the British Library
Library of Congress Cataloging-​in-​Publication Data
Names: Zhang, Hongming (College teacher), editor. | Qian, Youyong, editor.
Title: Prosodic studies : challenges and prospects /
edited by Hongming Zhang and Youyong Qian.
Description: Abingdon, Oxon ; New York : Routledge, 2019. |
Series: Routledge Studies in Chinese Linguistics |
Includes bibliographic references and index.
Identifiers: LCCN 2019014171 (print) | LCCN 2019980031 (ebook) |
ISBN 9780815380580 (hardcover) | ISBN 9781351212878 (ebook) |
ISBN 9781351212861 (pdf) | ISBN 9781351212847 (mobi) |
ISBN 9781351212854 (epub)
Subjects: LCSH: Prosodic analysis (Linguistics) | Grammar, Comparative and
general–Phonology. | Grammar, Comparative and general–Syntax. |
Grammar, Comparative and general–Morphology. | Language acquisition.
Classification: LCC P224 .P738 2019 (print) |
LCC P224 (ebook) | DDC 414/.6–dc23
LC record available at
LC ebook record available at
ISBN: 978-​0-​8153-​8058-​0 (hbk)
ISBN: 978-​1-​351-​21287-​8 (ebk)
Typeset in Times New Roman
by Newgen Publishing UK


List of figures  vii

List of tables  ix
List of contributors  xii

Introduction  1

Prosodic hierarchy  7

1 Life after the Strict Layer Hypothesis: Prosodic structure

geometry  9

2 The Revised Max Onset: Syllabification and stress in English 


3 Enclitics and the clitic group consisting of “host+enclitic”

in the Fuzhou dialect  80

Prosodic patterns  109

4 Geographical clines in the realization of intonation

in the Netherlands  111

vi Contents
5 A prosodic essence conjecture  141

6 Phonological representations based on statistical

modeling in tonal languages  159

7 Prosodic encoding of contrastive focus in Shanghai Chinese 


Interface between prosody and syntax/​morphology  225

8 What kinds of processes are postlexical? And how

powerful are they?  227

9 Match Theory and prosodic well-formedness constraints  252


10 Prosodic studies of two Chinese dialects  275


Prosody in language acquisition  315

11 Perceptual development of phonetic categories in early

infancy: Consonants, vowels, and lexical tones  317

12 F0 development in Cantonese pre-​adolescent children  331


13 The positional effects of contour tones in second

language Chinese  345

Language index  369

Subject index  371


1.1 Prosodic hierarchy (Nespor and Vogel 1986/​2007) 11

1.2 Composite Prosody Model with tri​partite prosodic hierarchy 50
1.3 Transition between upper and lower interface constituents in
prosodic hierarchy 50
3.1 Prosodic hierarchies (Zhang 1992, 2017) 84
4.1 Recording locations in the Netherlands 114
4.2 Mean sonorant rime duration in non-​final falls, final falls, and
final fall-​rises for each variety 118
4.3 Mean proportional peak timing in non-​final falls, final falls,
and final fall-​rises for each variety 121
4.4 Mean scaling in semitones of H and L in nf-​FALLS (left-​hand
panel) and f-​FALLS (right-​hand panel) for each variety 122
4.5 Mean scaling in semitones of H, L, and H2 in f-​FR for
each variety 123
4.6 Mean f0 duration in ms (left panel), f0 excursion in
ST (center panel) and f0 slope in ST/​s (right panel) for
non-​final and final falls, broken down by dialect 125
4.7 Mean f0 duration in ms (left panel), f0 excursion in ST
(center panel), and slope in ST/​s (right panel) of the falling
(FR1) and rising (FR2) movements of final fall-​rises 127
4.8 Duration ratio, excursion ratio, and slope ratio between the
falling and rising movement of f-​FR 129
4.9 Schematic representation of fall-​rise types in peripheral
(Zuid-​Beveland and Winschoten) and central varieties 132
6.1 Proportion of glottalization on Tones 7 and 8 by 30 speakers 166
6.2 Duration values of eight tones 168
6.3 The plot for the result of the Dunnett’s test (control
group: Tone 7) 168
6.4 The plot for the result of the Dunnett’s test (control
group: Tone 8) 170
6.5 The Pareto chart of the main effects (reaction time) 176
6.6 The half-​normal plot of the effects (reaction time) 176
6.7 The Pareto chart of the main effects (A’ value) 177

viii List of figures

6.8 The half-​normal plot of the effects (A’ value) 178
6.9 Tone 1 183
6.10 Tone 2 184
6.11 Tone 3 184
6.12 Tone 4 185
6.13 Tone 5 185
6.14 Tone 6 186
6.15 Tone 7 186
6.16 Tone 8 187
7.1 The time-​normalized f0 contours of the four target
syllables within the four sentence types, uttered in
non-​focused condition 204
7.2 The time-​normalized f0 contours of the four stimulus
sentences (Types 1–​4), uttered in non-​focused condition
(N-​F: red) and focused condition with contrastive focus
on S1 (F-​S1: dark green), on S2 (F-​S2: green), on S3
(F-​S3: blue), and on S4 (F-​S4: purple) 208
7.3 Box plots of the rhyme duration (left) and mean intensity
(right) of each target syllable 214
11.1 (a) Pitch trajectories of example stimuli of Tone 2 and
Tone 3. (b) Pitch trajectories of example stimuli of
Tone 1 and Tone 4 324
11.2 Results of both younger and older Mandarin-​learning
infants for the Tone 2–​Tone 3 (left two columns) and for
Tone 1–​Tone 4 (right two columns) contrasts 326
12.1 Developmental (mean) F0 change in the Cantonese tones
[55 33 22] for male (upper panel) and female (lower panel)
children at 4 to 12 years of age and adults in early 20s 335
12.2 Developmental change in the (mean) F0 values (in Hz) of the
English [ɑ] and the Cantonese [a]‌associated with each one of
the three tones [55 33 22] for male (upper panel) and female
(lower panel) children at 5 to 12 years of age and adults aged
18 or in early 20s 340
13.1 General error rates 354
13.2 Accuracy rates of T2 in various tone sequences 357
13.3 Accuracy rates of T4 in disyllabic words 357
13.4 The effect of anticipatory dissimilation on T2 and T4 363


3.1 Phonological rules and different constructions in the

Fuzhou dialect 101
4.1 Dutch context sentences and experimental sentences used
to elicit non-​final falls, final falls, and final fall-​rises, with
English translations 113
4.2 Number of speakers used in the analyses, broken down by
variety, sentence condition, and gender 115
4.3 Overview of acoustic measurement labels 116
4.4 Acoustic variables used in the comparison of non-​final and
final nuclear contours in five varieties 117
4.5 Effect of Dialect, Gender, and Sentence_​condition
on RimeDuration 119
4.6 Effect of Dialect and Gender on RimeDuration in
non-​final falls, final falls, and final fall-​rises 119
4.7 Pairwise comparisons for RimeDuration between levels
of Dialect, separately for each sentence condition 120
4.8 Effect of Dialect, Gender, and Sentence_​condition
on H_​RelTiming 121
4.9 Effects of Dialect on the scaling of the nuclear peak,
the elbow, and the final high target in final fall-​rises 123
4.10 Effects of Dialect on FallDuration, FallExcursion,
and FallSlope in non-​final falls and final falls 126
6.1 Eight tones in Chongming Chinese 160
6.2 Proportion of glottalization accompanying
Tone 7 and Tone 8 167
6.3 The duration time for each tone 169
6.4 The results of the Dunnett’s test (Tone 7 as a control group) 169
6.5 The results of the Dunnett’s test (Tone 8 as a control group) 169
6.6 Pairs of allotones to be compared 172
6.7 An example of the fractional factorial design with
three variables 172
6.8 Reaction time for each group 175
6.9 ANOVA table of reaction time for each group 177

x Tables
6.10 A’ value for each group 177
6.11 ANOVA table of A’value for each group 178
6.12 A t-​test for accuracy rate and reaction time concerning
glottalization (Allotone 7) 179
6.13 A t-​test for accuracy rate and reaction time concerning
glottalization (Allotone 8) 180
6.14 A t-​test for accuracy rate and reaction time concerning
duration (Allotone 7) 180
6.15 A t-​test for accuracy rate and reaction time concerning
duration (Allotone 8) 181
6.16 Models chosen for each tone and estimated coefficients 188
6.17 Quantiles based on fitted models and transformation to
tone letters 188
6.18 Phonological representations based on acoustic values 189
7.1 The value of citation tones and sandhi tones in SHC 201
7.2 Stimulus sentences 202
7.3 An example of discourse contexts 203
7.4 The description of the tonal realization of S1+S2 compound
and S3+S4 phrase 206
7.5 The effects of contrastive focus on the maxf0 and minf0 of
each syllable 210
7.6 The effects of contrastive focus on the rhyme duration and
mean intensity of each syllable 211
12.1 Means (n = 5) and standard deviations (SD) of the ages of
male and female children of nine age groups from 4 to 12 years 333
12.2 The mean F0 values (in Hz) of the Cantonese tones [55 33 22]
for male and female children at 4 to 12 years of age and
adults in early 20s 337
12.3 Ratios of the mean F0 values (in Hz) of the Cantonese tones
[55 33 22] for children at 4 to 12 years of age to those for
adults in early 20s of the same gender 337
12.4 F0 ratios of females to males of each of the age groups,
4–​12 and early 20s, for the Cantonese tones [55 33 22] 338
12.5 F0 ratios of the Cantonese tones [55] to [33] and [33] to [22]
for male and female children at 4 to 12 years of age and
adults in early 20s 339
12.6 F0 ratios of children to adults of the same gender for
Cantonese and English 341
12.7 F0 ratios of females to males of the same age group for
Cantonese and English 342
13.1 Potential influence of anticipatory coarticulation on T2
and T4 accuracy rates 350
13.2 Error patterns with positional information 355
13.3 Substitutions with positional information 356
13.4 Average F0 values of T2 offsets in correct productions 358

Tables xi
13.5 Average F0 values of T4 onsets in correct productions 359
13.6 The top three disyllabic response tones for target T2 (LH)
at initial positions 360
13.7 Statistical analyses of error type comparisons for T2-​T1,
T2-​T4, T4-​T1, and T4-​T4 361


Si Chen received her PhD in linguistics and her MS in statistics from the
University of Florida in 2014. She works on statistical modeling of speech
production, perception and their relationship, as well as applications in
speech training and speech therapy. She has developed statistical models
to solve challenging problems in phonology and simulate the human
perception process in extracting linguistic information from varied
speech signals of tones. Her publications include Chen, Si, Caicai Zhang,
Adam McCollum and Ratree Wayland (2017) “Statistical Modeling of
Phonetic and Phonologised Perturbation Effects in Tonal and Non-Tonal
Languages,” Speech Communication, 88, pp. 17–38.
San Duanmu is Professor of Linguistics, University of Michigan. He received
his PhD in Linguistics from the Massachusetts Institute of Technology in
1990 and has held teaching posts at Fudan University, Shanghai (1981–​
1986) and the University of Michigan, Ann Arbor (1991–​present). His
research focuses on general properties of language, especially those in
Jun Gao is Associate Professor of the Institute of Linguistics, Chinese
Academy of Social Sciences. Her research interests are phonological devel-
opment, infant speech perception, and children’s speech production. She
has published Shi, R., Gao, J., Achim, A., and Li, A. (2017) “Perception
and representation of lexical tones in native Mandarin-​learning infants
and toddlers,” Frontiers in Psychology, 8, p. 1117.
Carlos Gussenhoven is Professor Emeritus of General and Experimental
Phonology at Radboud University (Nijmegen, the Netherlands). He has
analyzed the phonologies of a number of languages, with a special orien-
tation on prosody. He has published The phonology of tone and intonation
(2004, Cambridge University Press) and coauthored Understanding phon-
ology (1998, 4th edition 2017, Routledge).
Judith Hanssen obtained her PhD from Radboud University (Nijmegen,
the Netherlands) in 2017. She specializes in phonetic and phonological
variation in (dialect) intonation, which resulted in a dissertation entitled

List of contributors xiii

Regional variation in the realization of intonation contours in the Netherlands.
She is currently a lecturer in research methodology and English at Avans
University of Applied Sciences in the Netherlands.
Junko Ito is Professor of Linguistics at the University of California, Santa
Cruz. Her current research interests are the prosodic hierarchy, Optimality
Theory, and syllable and foot structure. Her publications include Ito, Junko
(1989) “A prosodic theory of epenthesis,” Natural Language and Linguistic
Theory, 7, pp. 217–​260, and Ito, Junko and Armin Mester (2015) “The per-
fect prosodic word in Danish,” Nordic Journal of Linguistics, 38, pp. 5–​36.
Ellen M. Kaisse is Professor Emerita of Linguistics at the University of
Washington. She studies Spanish, Greek, and Turkish phonology; phono-
logical interactions between words; and the phonology-​ morphology
connection. Her publications include Connected speech: The interaction of
syntax and phonology (1985, Orlando, FL & London: Academic Press) and
Harris, James and Kaisse, Ellen M. (1999). “Palatal vowels, glides and
obstruents in Argentinian Spanish,” Phonology, 16(2), pp.117–​190.
Wai-​Sum Lee is Associate Professor at the City University of Hong Kong. Her
research interest is the phonetics of the Chinese dialects. She is a member
of the editorial board of the Chinese Journal of Phonetics, a member of the
Council of the International Phonetic Association, and vice-​president of
the Phonetic Association of China.
Jie Liang is a professor in the School of Foreign Languages at Tongji
University. Her research interest lies in the production and perception
of speech sounds, especially with Chinese lexical tones. Her publications
include Liang, J. and van Heuven, V. (2004) “Evidence for separate tonal
and segmental tiers in the lexical specification of words,” Brain and
Language, 91, pp. 282–​293 and Liang, J. (2006) Experiments on the modular
nature of word and sentence phonology in Chinese Broca’s patients. LOT
PhD Dissertation 131, Utrecht.
Bijun Ling is a lecturer in the International School of Tongji University. Her
research interest lies in phonetics and phonology, especially with Chinese
lexical tones. She has published Ling B., and Liang J. (2017) “Focus
encoding and prosodic structure in Shanghai Chinese,” Journal of the
Acoustical Society of America, 141(6), pp. 610–​616.
Armin Mester is Professor of Linguistics at University of California, Santa
Cruz. His current research interests are principles of prosodic structure
(stress, accent, etc.); mapping of syntactic and morphological structures
onto prosodic form; and Optimality Theory. His publications include
Mester, Armin (1994) “The quantitative trochee in Latin,” Natural
Language and Linguistic Theory, 12, pp. 1–​ 61 and Ito, Junko, and
Mester, Armin (2013) “Prosodic subcategories in Japanese,” Lingua 124,
pp, 20–​40.

xiv List of contributors

Jörg Peters is Professor of German linguistics at Carl von Ossietzky University
Oldenburg (Germany). His research interests are the phonetics and phon-
ology of German, with a focus on prosody. His publications include
Intonation deutscher Regionalsprachen (2006, de Gruyter) and Intonation
(2014, Winter).
Rushen Shi is the director of the Language Research Group (
ca), Université du Québec à Montréal, Canada. She is interested in funda-
mental mechanisms underlying language acquisition (see her acquisition
model in her 2014 article in the journal Child Development Perspective and
other publications of empirical findings on her lab website:
Irene Vogel received her PhD in Linguistics from Stanford University, and
she is currently Professor of Linguistics at the University of Delaware. Her
research addresses different aspects of prosodic phenomena and interfaces
between phonology and other components of grammar. From the theoret-
ical perspective, Dr. Vogel is continuing to develop the theory of Prosodic
phonology (Nespor and Vogel 1986, reprinted 2007), and from the experi-
mental perspective, she is heading the Prosodic Typologies Lab, which
is conducting a large-​scale cross-​linguistic investigation of the acoustic
properties of word level (stress, tone) and phrase level (focus) prosodic
Lian-​Hee Wee is Professor of Linguistics and Associate Dean of Arts at
the Hong Kong Baptist University. His research focuses on the phono-
logical properties of Chinese languages and Asian Englishes. His latest
publications include “Tone assignment in Hong Kong English,” in
(Language), Phonological Tone (2019, Cambridge University Press) and a
coedited volume, Cultural conflict in Hong Kong (2018, Palgrave).
Shuxiang You is a research assistant professor in the Department of Linguistics
and Modern Languages at the Chinese University of Hong Kong and
assistant editor of the International Journal of Chinese Linguistics. He
obtained his PhD in Chinese linguistics at the University of Wisconsin-​
Madison in 2017. His current research interests include phonology,
phonology-​syntax interface, Chinese dialectology, and teaching Chinese as
a second language.
Hang Zhang is Associate Professor of Chinese Language and Linguistics
at the George Washington University. Her research focuses on second
language phonology. She has published in major journals in the field of
second language acquisition such as Second Language Research, Chinese
as a Second Language, and International Journal of Applied Linguistics.
Her recent book Second Language Acquisition of Mandarin Chinese Tones:
Beyond First-Language Transfer was published by Brill in 2018.


List of contributors xv
Hongming Zhang is Professor and Head of the Chinese Language &
Linguistics Program at the University of Wisconsin-​Madison. He is also
executive editor of International Journal of Chinese Linguistics, series
editor of Routledge Studies in Chinese Linguistics, and editor of the
volume Phonology and Poetic Prosody of The Encyclopedia of China (3rd
edition). His recent published books include Syntax-​phonology inter-
face: Argumentation from tone Sandhi in Chinese dialects and Tonal prosody
in Yongming style poems.

Hongming Zhang

The 13 papers in Prosodic Studies: Challenges and Prospects are selected

from the contributions presented at the International Conference on
Prosodic Studies: Challenges and Prospects (ICPS), held in Tianjin, China,
13–​14 June 2015. ICPS was co-​organized by Tianjin Normal University,
Nankai University, Tianjin Foreign Studies University, the Editorial Office
of Contemporary Linguistics of the Chinese Academy of Social Sciences
(CASS), Key Lab of Phonetics and Speech Science of CASS, and University
of Wisconsin-​Madison. ICPS included four keynote speeches, fourteen invited
speeches and forty-​five regular session presentations. More than 200 scholars
from around the world attended this conference.
As stated in the title of this volume, the papers collected here not only
challenge the current prosodic studies, including the limitation of current
theories, models, research methods, and so forth, but also indicate the pro-
spective research trends in the field of prosodic study, with each contributing
a unique and fresh perspective. The chapters in this volume address many hot
issues related to various aspects of prosodic studies, covering prosodic hier-
archy, prosodic patterns, interface between prosody and syntax/​morphology,
the experimental approach to prosody, and prosody in first and second lan-
guage acquisition.
Prosody is one of the core components of language and speech, which
indicates information about syntax, turn-​taking in conversation, types of
utterances, such as questions or statements, as well as speakers’ attitudes
and feelings. Prosody plays an important role in human speech perception.
A sequence of words, if not accompanied by prosody, is hard to be perceived
by listeners. To a certain extent, communication would not be effective
without prosody.
A substantial literature on prosody from diverse perspectives emerged in
the past several decades. First, the prosodic hierarchy and the prosodic units
in human languages are the core, yet unsolved, issues in prosodic phonology.
The prosodic hierarchy and the prosodic units presented in Selkirk (1984,
1986) and Nespor and Vogel (1986, 2007) have stimulated a large amount of
research with regard to prosodic studies in a variety of fields. However, they are

2 Hongming Zhang
also challenged by a large number of counterexamples in various languages,
and thus are subject to revision and updating. We begin with three chapters
in Part I to discuss this topic. Irene Vogel discusses the challenge of how to
constrain prosodic structure in the absence of the Strict Layer Hypothesis
(SLH), focusing on the Composite Prosodic Model, which includes a distinct
constituent between the phonological word and the phonological phrase –​
the composite group. She thus offers a more nuanced model of the prosodic
hierarchy that recognizes three different sub-​parts according to the nature of
their interface with other grammatical components. San Duanmu examines
syllabification and stress in English, and shows that it is possible to compare
current analyses in a consistent way and determine which ones fare better.
Specifically, he shows that the Law of Initials, the Law of Finals, and Max
Onset can all be satisfied at the same time (yielding Revised Max Onset). He
also claims that syllabification and stress can be evaluated simultaneously,
rather than being sequentially ordered. The resulting analysis yields con-
sistently good foot structures, less violation of the Weight-​Stress Principle,
and higher percentages of correct predictions of main stress. Shuxiang You
analyzes clitics and the clitic group in Fuzhou Chinese. A thorough study of
the relevant data in Fuzhou, from the perspectives of both morphosyntactic
functions and phonological behavior, reveals that clitics in this dialect share
some common morphosyntactic and phonological properties with clitics in
other languages. Although enclitics and proclitics in Fuzhou Chinese show
asymmetries in terms of their phonological behavior, the clitic group as a
whole has very peculiar phonological behavior as compared to lexical items
and phrases, which provides motivation and evidence for the establishment of
the clitic group domain in this dialect. Moreover, Fuzhou clitics may attach
to constituents higher than the prosodic word, which constitutes a great
challenge to the Strict Layer Hypothesis.
The topic of Part II focuses on prosodic patterns. Along with the fast
development of IT and computer science, many studies have adopted an
experimental and computational approach in prosody research. Scholars have
shown increasing interest in examining the acoustic parameters associated with
prosodic phenomena. The results of these studies are widely applied in multi-
media communication, including text to speech, speech recognition, speech
synthesis, and so on. There are four chapters that contribute to Part II. Judith
Hanssen, Carlos Gussenhoven, and Jörg Peters look at additional data from
a project to see whether we can replicate the finding of a geographical cline
in the realization of non-​final nuclear falling contours, and whether it is also
found for IP-​final nuclear contours. They discuss the effect of Dialect on the
phonetic realization of contours, as opposed to effects of time pressure, focus,
or word boundary location. They report dialectal differences in segmental
duration as well as tonal timing, pitch excursion, pitch slope, and overall
pitch level. It is well known that, compared to non-​final falls, final falls may
be realized with longer segmental durations, earlier nuclear peaks, or steeper
or shorter falling excursions. Lian-​Hee Wee proposes a Prosodic Essence

Introduction 3
Conjecture (PEC), which implies a new perspective on language typology in
place of tradition notions of tone versus stress languages. A corollary of PEC
is that tone and accent are phonetically the same; thus, prosodic principles of
meter (such as minimum word requirements) would be universal. PEC rules
out prosodic contrasts where length or intensity is used without allowing
pitch. PEC does not supersede typology derived from prosodic marking at
different levels (syllable, word, phrase). Si Chen argues for statistical modeling
of phonetic data in providing a phonological representation of tones using
Chao’s letters or the L, M, and H representations. The chapter first focuses on
phonetic examinations of several phonetic cues subject to a perceptual experi-
ment. Then, the perceptual study shows that other cues found in the phonetic
examination do not contribute significantly to the discrimination of allotone
pairs after voiced versus voiceless onset, and that F0 contours are sufficient
in discriminating those allotone pairs without onset consonants. The F0
contours are statistically modeled, and the underlying pitch targets statistic-
ally tested to be quadratic correspond well to record in the fieldwork. The
fitted values obtained from the optimal model were calculated, and sample
quantiles are obtained. The final representations provide similar basic tonal
shapes with some differences in the exact integers used for the onset, turning
point, and offset. This method provides a representation more consistent with
normalized phonetic F0 values, taking the perceptual aspect into consider-
ation. Bijun Ling and Jie Liang focus on the acoustic realization of focus
and lexical tones in Shanghai Chinese, a word-​tone language. This was done
through an investigation of F0 and durational adjustment of disyllabic words
in short sentences.
Three chapters in Part III are about the interface between morphosyntax
and prosody. It has been widely observed that phonological structure is sen-
sitive to morphosyntactic structure, but what elements of phonological struc-
ture and how the phonology are influenced by morphosyntactic structure are
still open to debate (Kaisse 1985). Ellen Kaisse reports on an initial survey of
processes in the phonological literature described as applying across words,
and she speculates on why postlexical application is so strongly skewed toward
certain kinds of processes and not others. Junko Ito and Armin Mester show
that the recursion-​based conception within Match Theory allows for a con-
ceptually and empirically cleaner understanding of the phonological facts and
generalizations in Japanese as well as for an understanding of the respective
roles of syntax and phonology in determining prosodic constituent struc-
ture organization, and the limitation in types of distinctions in the prosodic
category that are made in phonological representation. Hongming Zhang
discusses some interface issues through case studies of Xiamen Chinese and
Pingyao Chinese, and tries to prove that the Optimality Theory (OT) fails to
capture the nature of tone sandhi in the cases of both Xiamen and Pingyao by
brutal force or ad-​hoc constraints, and that the interface theory under the OT
framework does not have explanatory power superior to that of the theory
proposed before the OT era.

4 Hongming Zhang
The chapters in Part IV by three contributors study the prosody in lan-
guage acquisition. More and more studies on prosodic properties have been
conducted in the field of first and second language acquisition. There is an
emerging interest in the following questions: When and how do infants acquire
prosodic information? What is the difference between the prosody of native
and non-​native speech? How are prosodic characteristics of second language
speech related to the degree of foreign accent? Jun Gao and Rushen Shi pre-
sent their empirical findings on infants’ perception of lexical tones during the
first year of life. The findings shed light on the mechanisms of first language
acquisition, in which input-independent capacities and input-guided learning
both play a role. Wai-​Sum Lee analyzes F0 (pitch) development in Cantonese
pre-​adolescent children, male and female, aged 4–​12 years. Her main findings
include (i) a progressive F0 decrease as age increases, (ii) a large F0 drop at age
12 in male children, indicating the onset of adolescent voice change, (iii) no
significant F0 difference between female children at age 12 and female adults,
indicating the end of female adolescent voice change, and (iv) no apparent
gender distinction in voice until age 12. Hang Zhang investigates the errors
made by 60 English, Japanese, and Korean speakers learning Chinese when
producing the two contour lexical tones T2 (rising tone) and T4 (falling tone).
This study finds that T2 is produced at a greater rate of accuracy in word-​
initial positions, while T4 is produced at a greater rate of accuracy in word-​
final positions. This study also finds two intertonal effects shared across the
three groups of speakers: (a) the accuracy rate of T4 is always greater when it
is followed by low tones than when it is followed by other tones, and (b) the
accuracy rate of T2 is always greater when it is followed by tones with low
onsets than when it is followed by tones with high onsets. Findings suggest
that second language tones are constrained by the cross-​linguistically common
phonetic mechanism of anticipatory dissimilation.
To conclude, this volume, as a reflection of current prosodic studies, is
not only worth reading for scholars who are interested in prosody but also
for theoretical linguists, psycholinguists, and scholars investigating language
In planning this project, I had two criteria in mind: broad coverage and
balanced perspectives. It is gratifying to note that the finished chapters have
come together as planned. A good range of topics –​prosodic hierarchy,
prosodic patterns, interface between prosody and syntax/​morphology, and
the prosody in language acquisition –​is covered. Our chapters also reflect
a balanced participation by Western and Eastern scholars, as well as by
phoneticians and phonologists. The approaches employed, too, display a
balance between empirical analysis and theoretical inquiry. It is our hope that
this volume will draw more scholarly attention to the prosodic studies in the
field of both Chinese linguistics and Western linguistics.
Finally, I would like to express my deep gratitude to Tianjin Normal
University, Nankai University, Tianjin Foreign Studies University, the
Editorial Office of Contemporary Linguistics of the Chinese Academy of

Introduction 5
Social Sciences (CASS), and Key Lab of Phonetics and Speech Science of
CASS for funding the international conference “Prosodic Studies: Challenges
and Prospects” in June 2015. I also wish to thank my co-​editor, Youyong
Qian, who generously gave his time to help with the editing of this volume,
competently handled all technical and clerical matters, and acted as liaison
with the press and individual contributors.

Kaisse, E. M. (1985) Connected speech: The interaction of syntax and phonology.
New York; San Diego: Academic Press.
Nespor, M., and Vogel, I. (1986) Prosodic phonology. Dordrecht: Foris.
Nespor, M., and Vogel, I. (2007) Prosodic phonology: With a new foreword.
Berlin: Mouton de Gruyter.
Selkirk, E. (1984) Phonology and syntax: The relation between sound and structure.
Cambridge, MA: MIT Press.
Selkirk, E. (1986) “On derived domain in sentence phonology”, Phonology Yearbook,
3, pp. 371–​405.

Part I

Prosodic hierarchy

Life after the Strict Layer Hypothesis
Prosodic structure geometry1
Irene Vogel

1.1 Introduction
Although Pāṇini studied phonological phenomena that apply across different
types of junctures (i.e., word-​internal and word-​external sandhi phenomena)
over 2,000 years ago, it is only in the last few decades that we have seen a sub-
stantial rekindling of interest in juncture phenomena in modern linguistics.
For example, different types of junctures were directly encoded by different
boundary types in Sound Pattern of English (SPE)-​type phonological analyses
and implicitly encoded in the levels of lexical phonology. Most recently, pros-
odic phonology has provided a means of addressing the different domains of
application of phonological phenomena in terms of phonological or prosodic
constituents that are mapped from morphosyntactic structures, but which
might or might not be isomorphic to those structures.
While the details of the number and nature of the constituents vary to
some extent across analyses, a core principle in early models of prosodic hier-
archies (e.g., Nespor and Vogel 1986/​2007; henceforth N&V) was the so-​called
Strict Layer Hypothesis (SLH). The SLH served to significantly restrict the
geometry of prosodic hierarchies by requiring that a constituent of a par-
ticular level (Cn) dominate only constituents of the immediately lower level
(Cn-​1); however, it was soon realized that the SLH was too restrictive and thus
had some undesirable consequences. This chapter examines the implications
of the SLH and considers proposals to weaken it in order to overcome the
drawbacks. It will be demonstrated that the two main components of such
proposals, allowing levels to be skipped in the prosodic hierarchy and the
introduction of recursive constituents, while resolving some problems, also
introduce new complications. In fact, this is not surprising since weakening
strong limitations on any system, and the prosodic hierarchy is no exception,
will automatically increase the options within that system. The challenge then
becomes how to limit the newly available structures to avoid excessive and
otherwise undesirable options.
Three recent proposals for re-​constraining prosodic structure in the absence
of the SLH are assessed with regard to their adequacy in constraining pros-
odic structure geometry as well as their success in accounting for a range of
phonological phenomena. First, Match Theory (e.g., Selkirk 2011) and the

10 Irene Vogel
Adjunction Approach (e.g., Itô and Mester 2009a, b), both of which exclude
a constituent between the phonological word and phonological phrase, but
admit recursive constituents, are examined and shown to have a number of
drawbacks with respect to constraining the prosodic hierarchy as well as
accounting for certain types of phonological phenomena. An alternative pro-
posal, the Composite Prosody Model, is advanced and shown to overcome
fundamental problems in the other approaches.
Crucially, the Composite Prosody Model includes an explicitly defined
prosodic constituent between the phonological word and the phonological
phrase, the composite group (roughly similar to the previous clitic group
(CG)). It is demonstrated that it is specifically the inclusion of this constituent
that allows us to avoid a number of the drawbacks of the other prosodic
models, permitting the formulation of small set of strong restrictions on the
general architecture of the prosodic hierarchy as well as providing straightfor-
ward analyses of a range of phonological phenomena in different languages.
The Composite Prosody Model, moreover, recognizes a three-​way distinc-
tion among sets of prosodic constituents within the prosodic hierarchy based
on the nature of their interface with other components of grammar (syntax,
morphology, or no interface), but it also provides the means of unifying the
different sets of constituents through the formulation of a small number of
principles that govern the overall geometry of the prosodic hierarchy.
Specifically, in Section 1.2, the role of the SLH in prosodic phonology is
reviewed, and its problems, as well as its contributions, are considered. Then
Section 1.3 discusses the main proposals for weakening the SLH, focusing on
skipping levels in the prosodic hierarchy and recursion. Since weakening the SLH
introduced a number of new drawbacks, recent approaches to addressing these
problems are considered in Sections 1.4 and 1.5. In the former, models without
a constituent between the phonological word and phonological phrase (Match
Theory, Adjunction Approach) are examined, and in the latter, the Composite
Prosody Model, with the intervening composite group, is examined. Section
1.6 synthesizes the different types of considerations addressed in the preceding
sections and addresses the question of whether the differences among the various
constituents mean that it is not feasible to maintain a single prosodic hierarchy.
It is argued that the Composite Prosody Model does, in fact, provide the means
of unifying the prosodic hierarchy, while also recognizing important differences
among the constituent levels. Finally, Section 1.7 offers general conclusions.

1.2 The prosodic hierarchy and the role of the Strict Layer

1.2.1 Prosodic constituents

Building on insights about the inadequacy of syntactic constituents as
the domain of application of Liaison in French (Selkirk 1972), subse-
quent research proposed a model of phonology consisting of a series of

Life after the Strict Layer Hypothesis 11

Phonological Utterance ( )
Intonational Phrase ( )
Interface with other
components of
Phonological Phrase ( )
| grammar
Clitic / Composite Group (CG)
Phonological Word ( )
Foot ( ) No interface with
| other components of
Syllable ( ) grammar

Figure 1.1 Prosodic hierarchy

Source: Nespor and Vogel 1986/​2007

phonological, or prosodic, constituents that are related to, but not necessarily
identical to, syntactic structures.3 The approach was extended to mismatches
with morphological structure, and a combined hierarchy of the different types
of prosodic constituents was developed. The hierarchy was then sometimes
further extended to include smaller phonological structures consisting of more
than a single segment. An early model that incorporates these various types
of components is that presented in Nespor and Vogel (1986, 2007), shown in
Figure 1.1.4
Developments in prosodic phonology have included some differences
in the constituents of the hierarchy as well as the principles by which the
constituents are constructed.5 The constituent most commonly excluded from
the prosodic hierarchy is the CG, for reasons discussed below. The phono-
logical utterance is also frequently absent, generally because investigations
tend not to focus on phenomena with such large domains; however, in Match
Theory, it has to some extent been supplanted by a recursive intonational
phrase (Selkirk 2011).6 In other models, constituents are excluded on a case-​
by-​case basis; for example, it has been proposed by Schiering et al. (2010)
that in Vietnamese there are no constituents between the syllable and the
phonological phrase.
In some analyses, we also find proposals for additional or slightly
different constituents such as accentual and intermediate phrases, or roughly
corresponding major and minor phrases (among others, Beckman and
Pierrehumbert 1986; Elordieta 1997, 2007; Itô and Mester 2007, 2009a, 2012;
Jun 1998, 2005a (for overview); Selkirk et al. 2003; Selkirk and Tateishi 1988;
Shinya et al. 2004; Venditti 2005). The prosodic stem has also been proposed
as a constituent in the hierarchy, most notably for Bantu (e.g., Downing 1999;
Jones 2011) and Salish languages (e.g., Czaykowska-​Higgins and Kinkade
1998). A number of so-​called recursive constituents have been introduced as
well. These are most commonly found at the phonological word level (among

12 Irene Vogel
many others, Anderson 2005; Booij 1996; Hall 1999; Itô and Mester 2003,
2007, 2009a, b; Peperkamp 1997; Selkirk 1996, 2011; Vigário 2003), although
there are also proposals for recursive phonological phrase and intonational
phrase constituents (among others, Gussenhoven 2004, 2005; Itô and Mester
2007, 2009a, 2012; Ladd 1986, 1996/​2008; Selkirk 2011; Truckenbrodt 1999).
Different approaches to constructing the constituents have also been
advanced. The original procedure, which has come to be referred to as the
relational approach, used various types of morphosyntactic information (e.g.,
XP structure, side and branchingness of complements, functional elements) in
the mapping algorithms for constructing prosodic constituent structures (e.g.,
Selkirk 1978, 1980a, 1986; N&V 1982, 1986/​2007). Subsequent methods of
constituent construction have made use of morphosyntactic interfaces as well
but have relied on a series of different types of principles, indicated by their
names, for example, Alignment Theory (e.g., Selkirk 1986; Selkirk and Tateishi
1991), Wrap Theory (e.g., Truckenbrodt 1999), the adjunction approach (e.g.,
Itô and Mester 2009a, b), and most recently, Match Theory (e.g., Selkirk 2011).
By contrast, there is another category of constituent construction model,
referred to here as Phenomenon-​Based, where prosodic constituents are not
created on the basis of mappings from other grammatical constructs, but
rather on the basis of specific phonological phenomena observed in a par-
ticular language. For example, the Tone and Break Indices (ToBI) Approach
establishes a series of constituents in a language in relation to observed pitch
patterns and boundary phenomena such as lengthening and pausing (e.g.,
Beckman and Ayers 1994; Beckman and Hirschberg 1994; Jun 2005a; Venditti
2005; see Jun 2005b, 2014 for overview and language studies). Additionally, in
the Distributional Typology approach, prosodic constituents are constructed
as needed, based on the application of a language’s phonological rules and/​or
other patterns (e.g., Bickel et al. 2009; Schiering et al. 2007, 2010).
In the earlier models of the prosodic hierarchy, the overall geometry
was substantially restricted by the SLH; however, the SLH was soon found
to be too restrictive. Thus, despite differences in the number of prosodic
constituents and the means by which they were constructed, most subsequent
developments of prosodic theory have shared a common challenge of how to
appropriately weaken the SLH.

1.2.2 The Strict Layer Hypothesis Pros of the SLH

The main advantage of the SLH was that it imposed strong limitations on the
geometry of the prosodic hierarchy. Most notably, it only permitted a given
constituent to dominate constituents of the immediately lower level in the
hierarchy, a property referred to as “strict dominance” or “strict succession”.
That is, a constituent Cn could only dominate constituents of the type Cn-​1,
as shown in (1). It could not skip levels or introduce recursion, with Cn

Life after the Strict Layer Hypothesis 13

dominating either the same type or a higher type of constituent, as shown
in (2).

(1) Structure permitted by SLH


Cn-1 Cn-1

(2) Structures excluded by SLH

a.Skipping Levels b. Recursion 1 c. Recursion 2
Cn Cn Cn

Cn-1 Cn-1 … Cn-2 … Cn+1

As a result, prosodic trees had relatively flat and simple structures

compared to those of syntax and morphology. Moreover, the limited tree
structures resulted in limitations on phonological rule formulations (Selkirk
1980a; N&V), as shown in (3).7

(3) Types of prosodic rules

a. Domain span
[… _​_​_​ …]Cn
b. Domain limit
[_​_​ …]Cn    or   
[… _​
c. Domain juncture
[[…]Cn [_​_​ …]Cn]Cn+1   or   [[… _​
]Cn […]Cn]Cn+1

Domain span rules apply throughout a string of category Cn, without regard
for any internal structure; however, it was understood that, due to the SLH, the
internal structure of Cn could only contain one or more Cn-​1 constituents, and
that each of these constituents would be similarly structured. Domain limit rules
require the presence of a left or right edge of a given constituent type. It was also
understood that the edge in question would coincide with the corresponding
edge of the next lower level, and any additionally lower levels. In domain junc-
ture rules, two constituent levels must be taken into consideration, and the SLH
ensured that the juncture was between two constituents of the same type, and
that these constituents were contained within the same larger constituent.
As a consequence of determining what types of prosodic structures and
rules were possible, the SLH also made specific claims about what we would
not expect to find in languages. For example, it was predicted that we would not

14 Irene Vogel
find structures (and rules applying to structures) such as those in (4). To facili-
tate identification of the constituents, the phonological words are indicated
in bold and the phonological phrases are enclosed in braces (i.e., {}); where
multiple constituents of the same level are present, they are numbered sequen-
tially. The symbols ι, φ, ω, Σ, and σ represent, respectively, the intonational
phrase, phonological phrase, phonological word, foot, and syllable.8

(4) Impossible configuration according to SLH

* [{[…[ ]ω1]ω2 [ ]ω3}φ1 {[ ]ω4 [ ]Σ [ ]ω5 [ ]σ}φ2]ι

In (4), the intonational phrase (ι) dominates two phonological phrases, as

was allowed by the SLH; however, other aspects of the structure are prob-
lematic. First, the right edge of φ2 does not coincide with the right edge of
its internal ω, where there is, instead, an intervening syllable. Moreover,
while φ2 dominates two ωs in accordance with the SLH, it also skips levels,
additionally dominating the final syllable, and the foot between ω4 and ω5.
Furthermore, while φ1 dominates two ωs (ω2 and ω3), ω2 dominates ω1 in a
recursive structure. With regard to rule application, a ω juncture rule in the φ
domain would be able to apply between the adjacent ω2 and ω3 constituents;
however, it would not apply between ω4 and ω5 due to the intervening stray
foot. While the SLH avoided the “messiness” of structures such as (4), it also
introduced a number of problems. Cons of the SLH

Assuming the goal of (at least) generative linguistic theories to account for
all, and only, the possible human languages, restrictions on linguistic models
are necessary in order to prevent overgeneration of structures and rule types.
The SLH, in restricting prosodic tree geometry, did exclude many potentially
undesirable and/​or incorrect possibilities; however, it also excluded structures
that are actually attested in languages.
One criticism that has been leveled against the SLH is that it led to the
excessive overlap of constituents, most commonly involving phonological
words and CGs, as illustrated in (5) (among others, Vogel 2009).

(5) Overlap of phonological word and clitic group constituents

| | | | | |

large gray geese nest each night

In this case, not only do the ωs coincide with CGs, but they also happen
to coincide with feet, and the feet with syllables. While such overlapping
structures can be found in English, they do not constitute a substantial

Life after the Strict Layer Hypothesis 15

presence in the language. By contrast, we observe more consistent overlap in
so-​called isolating languages such as Chinese and Vietnamese, where many
words are monomorphemic, and indeed monosyllabic. Moreover, function
words that might join into a CG, such as articles, are often lacking. In fact, as
mentioned above, it has been claimed that Vietnamese does not exhibit evi-
dence for any distinct phonological constituents between the syllable and the
phonological phrase (Schiering et al. 2010).
While some cases of constituent overlapping might be eliminated if we do
not require all constituent levels to be present in all languages, it is not clear
that constituent overlapping is so undesirable as to qualify as the basis for
fundamentally altering the content and principles of the prosodic hierarchy.
Furthermore, even if many, or most, constituents of certain types overlap in
a language, there may still be some structures where this is not the case (e.g.,
particles that do not count as ωs in Chinese). A more principled problem
exists, however, with claiming that there is no need for a particular con-
stituent in a language. Such negative claims cannot be proven, and as pointed
out by Vogel (2009: 22), they, in fact, introduce the “Black Swan” problem,
since we cannot know whether different or more subtle analyses may subse-
quently reveal evidence for the constituent in question, the heretofore unob-
served “black swan” appearing after numerous white swans. However, even
in the absence of overt manifestation of a given constituent in a language, if
a set of prosodic constituents is part of universal grammar, the constituent
in question must by definition be present as, for example, tense markers in
Chinese (among others, Vogel 2008a, b, 2009).9 Finally, it should be noted
that the degree and type of overlap, instead of being problematic, might in
fact serve as typologically interesting phonological properties that lead to
additional linguistic generalizations.
Independently of the issue of overlap and the universality of prosodic
constituents, the SLH introduced a more clearly damaging flaw into prosodic
tree structure. It required that certain elements be promoted to higher-​level
constituents simply in order to form sisters of other such elements and be
parsed at the next level of the hierarchy. This is illustrated with the Italian
structure in (6); the subscript “CL” indicates a clitic element.

(6) SLH: Promotion of Clitics to Phonological Words


lo si serve
itCL oneCL serves ‘one serves it’

In order for the two clitics, lo and si, to combine into a CG (or phono-
logical phrase in a tree lacking the CG), according to the SLH, they must be
ωs, like the verb serve. This is problematic, however, since the clitics do not

16 Irene Vogel
otherwise have the properties associated with ωs (e.g., they only contain a
single mora and fail to satisfy word minimality; consequently, they also do not
exhibit stress like other ωs). Thus, while promoting the clitics to ωs allows the
CG to consistently dominate constituents only one level lower in the prosodic
tree, doing so compromises the crucial characteristics of the ω itself (among
others, N&V 2007; Vogel 1999, 2009).
At first glance, it might seem possible to combine the two clitics in (6) into a
ω, presumably by first combining them into a foot. This would yield a structure
that meets word (and foot) minimality, consistent with Itô and Mester’s (2003)
Maximal Parsing constraint that groups two syllables into feet in Japanese
word clippings, and two monosyllabic function words into feet in German
(Itô and Mester 2009a, following Kabak and Schiering 2006). Such a struc-
ture, however, yields incorrect results in Italian. That is, if the sequence lo si
constitutes a ω (i.e., [lo si]ω], it would incorrectly be subject to the (Northern)
Italian Intervocalic s-​Voicing rule, which applies within the ω domain (e.g.,
N&V), as shown in (7).

(7) Intervocalic s-​Voicing (ISV)11

a. Intervocalic s-​Voicing: s  [+voice] /​[… V _​_​V …]ω
b. ISV applies within a phonological word: [i[z]ola]ω ‘island’,
[famo[z]o]ω ‘famous’
c. ISV does not apply between clitics: *[lo [z]i]ω compra ‘(he) buys it
for himself’ (< lo si compra = it self buys)

Moreover, a ω would be expected to exhibit stress, and the vowel in the

stressed (open) syllable would be expected to undergo lengthening, as in
the word posi [pó:zi] ‘(you) place’. This does not occur, however, in the clitic
sequence lo si, where the correct form is [losi], not *[ló:zi].
In sum, while the overlapping of elements on two (or more) levels of the
prosodic hierarchy might be considered a problem associated with the SLH,
it does not constitute a clear argument against the SLH and, as mentioned, it
might in fact yield interesting typological insights. The promotion of elements
from lower to higher levels of constituency, however, does constitute a crucial
flaw since the promoted elements do not exhibit the requisite properties of
the higher constituents. In this case, it is no longer possible to uniquely and
unambiguously define the constituents, and the additional strings with the
same constituent labels may consequently result in incorrect predictions about
the application of phonological phenomena associated with these structures.

1.3 Weakening the SLH

To address the drawbacks of the SLH, proposals have been advanced to weaken
different components of the principle, rather than reject it completely, since
it does offer the important advantage of imposing restrictions on prosodic
structure geometry. The most widely adopted modification is the weakening

Life after the Strict Layer Hypothesis 17

of “strict dominance” to allow levels to be skipped in the prosodic hierarchy.
The SLH is also often weakened further to permit recursion; however, this has
the opposite effect. Instead of allowing a greater distance between the level of
a constituent C and the constituents it dominates, recursion yields structures
in which there is no distance between the levels since constituent C dominates
other constituents C. While these two modifications are often found together,
it should be borne in mind that they are, in fact, independent of each other.12
In the following sections, the main motivations and contributions of these
two modifications are examined, and their adequacy is assessed.

1.3.1 Skipping levels

The model of the prosodic hierarchy in Figure 1.1 above includes a set of
interface constituents, beginning with the phonological word, as well as a set
of lower, non-​interface, constituents. While the SLH prohibited the skipping
of levels in the interface constituents, a similar limitation was not necessarily
imposed on the lower constituents, where precedents for skipping levels can
be found, and indeed are often taken for granted. For example, the parsing of
extrasyllabic segments and extrametrical syllables typically involves skipping
levels, as illustrated with the Italian words in (8); the relevant elements are

(8) Skipping levels in non-​interface constituents

a. segment extrasyllabicity


s fi da ‘challenge’
b. syllable extrametricality


ca ser ma ‘barracks’

In (8a), /​s/​is excluded from the syllable onset with /​f/​in accordance with
the Sonority Sequencing Principle, and parsed directly into the ω, skipping
both the syllable and foot levels.13 In (8b), stress is on the penultimate syllable,
the head of its foot. The first (light) syllable cannot be included in the foot,
nor can it form a foot on its own, so it is parsed at the ω level, skipping the
foot level.

18 Irene Vogel
Given the precedents for skipping levels in the lower prosodic constituents,
weakening the SLH to permit the skipping of levels in the interface
constituents does not introduce a completely foreign option into phono-
logical structure, and it offers a solution to several problems mentioned in the
previous section. For example, if smaller constituents are no longer promoted
to larger constituents for which they lack the necessary properties, the struc-
ture in (6) above can be revised as in (9), where the syllables corresponding to
the clitics are parsed directly at a higher constituent level Cn (i.e., composite
group in the present model).

(9) Skipping levels: (6) revisited


= Cn-1
lo si serve
itCL oneCL serves ‘one serves it’

The revised structure avoids creating subminimal ωs, as well as the incorrect
combination of the clitics into a ω, where they would be expected to undergo
ω level phonological phenomena (cf. (7) above). Crucially, the structure in
(9) also makes the correct prediction regarding the lack of phonological inter-
action between adjacent clitics, and between clitics and their host. That is,
neither the /​s/​of the clitic si nor that of the verb serve becomes [z]‌since their
intervocalic contexts do not fall within the ω. Note that ISV is also correctly
predicted not to apply with a clitic following its host (e.g., guardandoω si not
guardandoω*[z]i ‘looking at oneself’).
Finally, although it is not the focus here, it should be noted that skipping
levels has also been proposed for higher constituents of the prosodic hierarchy.
For example, in Selkirk’s (2011) analysis of the Bantu language, Xitsonga, an
intonational phrase may directly dominate a phonological word, skipping the
φ level, as illustrated in (10).14

(10) Xitsonga: φ level skipped (Selkirk 2011)

[ [ndzi-nyíka mu-nw!í] [tí-n-g u:vu] ]

‘I am giving the drinker clothes’

It is argued that the final phonological word is parsed directly into the
intonational phrase since it undergoes high tone spread from mu-​nw!í; if

Life after the Strict Layer Hypothesis 19

it constituted a φ, its left boundary would block the tone spreading. Since
Selkirk does not include the composite group in her prosodic hierarchy, one
level (φ) is skipped here.

1.3.2 Recursion
Removing the strict dominance requirement of the SLH not only permitted
prosodic levels to be skipped, but it also opened the door for recursion. If a
constituent is not required to dominate only constituents of the next lower
level, it could just as well dominate constituents of its same level, or even a
higher level. These two options were seen above in (2b) as Recursion 1 (i.e.,
[… [ ]Cn]Cn), and (2c) as Recursion 2 (i.e., [… [ ]Cn+1]Cn). While both types of
recursion are found in syntax, only Recursion 1 is typically proposed for
prosodic structure; thus, an additional principle may be needed to exclude
Recursion 2. Note that if both types of recursion are allowed in prosodic
structure, the SLH is effectively eliminated, not just weakened.
Although there is no single type of motivation provided for the introduc-
tion of recursion across prosodic levels, the main considerations involve the
avoidance of constituent proliferation and the expression of similarities between
certain types of strings. The potential parallelism between prosodic and (recur-
sive) morphosyntactic structures is also considered a motivation in some cases,
especially at the higher prosodic levels (φ and ι), as discussed further below.
Like skipping levels, recursion has precedents in the lower, non-​interface
prosodic constituents. For example, Recursion 1 has been proposed to account
for extrasyllabic consonants, so instead of the type of structure seen above in
(8), a consonant that is excluded from a syllable for violating the Sonority
Sequencing Principle (SSP) would be included in a recursive syllable (σ’), as in
(11). In this larger σ’, the SSP is no longer in effect (among others, McCarthy
1979; see discussion in Watson 2011).15

(11) Recursive syllable

a. lapse: [ [læp] s] ’ b. stab: [s [tæb] ] ’

’ ’

[l æ p s] [s t æ b]

Similarly, recursive feet, sometimes referred to as super feet or suprafeet

(among others Itô and Mester 1992; Selkirk 1980b, 1984), have been proposed
to parse a third syllable that falls outside a binary foot, avoiding the creation
of a ternary foot, as illustrated in (12a) vs. (12b).

20 Irene Vogel

(12) Recursive vs. ternary foot structure

a. Recursive Foot b. Ternary Foot

Ca na da Ca na da

While the lower foot structure in (12a) meets the requirement that feet be
binary branching, with maximally two syllables, the upper foot, or Σ’, fails
to meet this requirement. It might be argued that the Σ’ is binary branching,
dominating a Σ and a σ; however, the content of Σ and Σ’ is nonetheless dis-
tinct. Thus, the seemingly recursive syllable and foot structures, in fact, exhibit
different, rather than the expected similar, properties at the repeated constituent
Although the role of the mora in the prosodic hierarchy is not totally clear,
recursion has been proposed for this element as well. In this case, recursion
is usually introduced to parse non-​moraic segments with moraic ones, for
example, combining an onset consonant with the vowel in a CV syllable (e.g.,
[t [a]‌μ]μ’).16 Recursive moras have also been proposed for Arabic as a means
of distinguishing between segments that do and do not count (i.e., contribute
weight) for the purpose of stress assignment. As shown in (13), only recursive
moras (i.e., with a moraic presence at both the lower and upper levels) con-
tribute to syllable weight, so the structure in (13a) constitutes a heavy syllable
but the one in (13b) does not (e.g., Hayes 1995; Watson 2002, 2011).

(13) Recursive mora for Arabic stress

a. Heavy Syllable b. Light Syllable

’ ’ ’
| | |


Differently from other recursive structures, both μ and μ’ correspond to the

same type of “weight” unit in Arabic. The μ’ cannot, however, be consistently
defined since in some cases a coda C constitutes a μ’, but in others it does not,
a determination that does not depend on the segment itself, but rather on its
position within a word (i.e., only a non-​final coda may be a μ’).
Turning now to the interface prosodic constituents, it can be seen that recur-
sion has an analogous effect to that observed in the lower constituents. That
is, when elements are parsed in a recursive constituent C’, this constituent

Life after the Strict Layer Hypothesis 21

acquires phonological properties that are different from those of the core C,
precisely because C’ incorporates the “stray” material that had been excluded
from C in the first place.
For example, if a stray syllable corresponding to a (level 2) affix17 is parsed
along with a ω in a recursive ω’ (e.g., [σ [ ]ω]ω’), the shared ω label suggests that
both the outer and inner structures (ω’ and ω) identify the same type of con-
stituent; however, the two types of phonological word exhibit different prop-
erties. In fact, this is not surprising, since the affix in question was originally
excluded from the (core) ω precisely because it did not participate in ω level
phenomena. A constituent that does contain the affix would then by definition
exhibit properties that are distinct from those of the original ω. For example,
it was seen that the Italian Intervocalic s-​Voicing rule applies within the ω, but
it does not apply between a level 2 prefix and a root. Thus, if a recursive ω’ is
created that contains such a prefix, it must be distinguished from ω, since the ω’
continues to exhibit [s]‌(e.g., [[ri]σ [salare]ω]ω’ ‘(to) re-​salt’ = [ri[s]alare], not *[ri[z]
Recursive structures are frequently introduced to accommodate not only
level 2 affixes but also the various other types of elements that are no longer
inappropriately promoted to higher constituent levels to satisfy the SLH, in
particular clitics and other function words. For example, structures like the
Italian clitic construction seen above in (9) are often analyzed with recursion,
as in (14) (e.g., Peperkamp 1997).

(14) Recursive phonological word with clitics

lo si serve
itCL oneCL serves ‘one serves it’

Again, there is a problem if C and C’ are considered the same type of con-
stituent (ω) since they exhibit different phonological behaviors. As with level
2 prefixes, the ω-​domain ISV rule fails to apply in a ω’ with clitics, and both
instances of /​s/​remain voiceless (i.e., [lo si [serve]ω]ω’, not *[lo zi [zerve]ω]ω’), as
noted above.
Although they do not necessarily involve stray elements, compounds are
also often analyzed as recursive phonological words. The individual members
form ωs on their own, and when they are combined into a compound word,
the result is labeled ω’, as in (15) and (16).

(15) Recursive phonological word for compounds

police academy

22 Irene Vogel

(16) Recursive phonological word for compounds –​multiple levels

a. ’ b. ’

’ ’

fish bowl light

fish bowl light factory

If ω and ω’ are the same type of constituent, as implied by the repeated

ω denomination, it is expected that they will show the same phonological
behavior. In this case, too, the facts are otherwise. While word (ω) stress is
assigned in relation to a combination of phonological and morphological
properties, in compounds, the Compound Stress Rule regularly enhances
the first element regardless of its morphological or phonological proper-
ties.18 Thus, stress is assigned to different positions in políce and acádemy,
but it predictably falls on the first member of the compound políce acádemy.
Moreover, both cases are distinct from the phrasal stress pattern with prom-
inence on the rightmost element (e.g., (the) lócal acádemy). Since the com-
pound formation possibilities in a language like English are vast, if each
compound forms another ω’, as in (16), the relationship between ω and ω’
becomes even less clear. Even if the multiple intermediate ω’s are reduced to
a single type of ω constituent (e.g., Itô and Mester 2007, 2009a, 2013; Selkirk
2011), the problem remains that the phonological properties of compounds
do not coincide with those of the individual ωs that compose them.
As previously noted, even if we do not observe properly recursive structures
in ω and ω’, it may be the case that the larger constituents that interface
with syntax do exhibit recursive properties, as in Selkirk’s (2011) analysis of
Xitsonga, exemplified in (17).

(17) Xitsonga: recursive intonational phrase (Selkirk 2011)

[ [va-xava ti- ho:m!u] va:-nhu]

buying,PE3 PL cattle people
‘they are buying cattle, the people are’

Life after the Strict Layer Hypothesis 23

In (17), we observe Penultimate Vowel Lengthening before the right edge of
both ι and ι’, suggesting that both C and C’ exhibit the same properties. The
significance of the potential difference in recursion at the lower and higher
prosodic levels is considered further below.
Finally, it should be noted that there is a fundamental structural difference
in the geometry of the C and C’ levels in recursive structures, regardless of
whether they exhibit similar or distinct phonological behaviors. That is, while
Proper Headedness is a requirement of the C level (i.e., constituent Cn must
dominate at least one Cn-​1), this is not the case for the corresponding C’ con-
stituent. C’ always dominates another C or Cs, which are crucially considered
to be the same type of constituent, so they do not meet the definition of the
head (= Cn-​1). If C’ then skips a level, directly dominating Cn-​2, this constituent
would also not meet the definition of the head. Thus, while Proper Headedness
may be a requisite of C, the same is not true for C’. In fact, in the structure in
(14) above, the ω’ dominates a ω and two syllables, but it does not dominate a
constituent of the next lower foot level. Note, however, that the ω does dom-
inate a (co-​extensive) foot, serve. Similarly, in (17), the ι’ dominates another ι
and a ω (possibly a ω’), but it does not dominate a φ, which would be its next
lower constituent. If a plain ω or ι in such cases may be taken to be the head
of its C’ constituent, this would suggest that it is not, in fact, the same type of
constituent as the C’ but a lower Cn-​1 constituent. If it is still argued that C is
the same type of constituent as C’, and that it may also serve as the head of
C’, Proper Headedness, as previously defined, can no longer be maintained as
a principle of prosodic structure in models that include recursive constituents.

1.3.3 Assessing the modifications

A usual way to assess a linguistic analysis or theory is to find counterexamples,
showing that it cannot adequately account for all human languages. This is
not as simple as it seems, however, since language is human, and therefore,
bound to exhibit “imperfections”. We must thus also assess whether a par-
ticular type of counterexample or failure represents a crucial flaw in a theory,
or only an exceptional or idiosyncratic aspect the data being examined.19 At
the same time, we must consider whether a clever solution advanced to make
an analysis “work” is actually making an insightful contribution to the model
or obscuring it by adding complications that are at best language specific.
Linguistic models that overgenerate the options are less likely to be challenged
by counterexamples; however, they are also inadequate in that they neglect the
requirement of a linguistic theory to account for only the set of natural human
languages. Thus, in the following assessments of the proposed modifications
of prosodic theory under consideration, in particular, skipping levels and
recursion, the focus is on their overall adequacy and their implications for the
geometry of the prosodic hierarchy in general, abstracting away from issues
related to language-​specific idiosyncrasies.

24 Irene Vogel Assessment of skipping levels
The single change of removing strict dominance results in an enormous
increase in possible constituent structures, as was illustrated above in (4). The
simple structures in (18), without recursion or a constituent between the ω
and the φ, offer further insight into the magnitude of the increase.

(18) Phonological phrase configurations –​skipping levels

a. b. c. d.


All of these structures respect Proper Headedness, since the φ dominates

a ω; additionally, the φ dominates one of the lower, non-​interface, prosodic
constituents, or a segment, the terminal node of a phonological structure
(e.g., –​s in Rob’[z]‌ car).20 In addition to the four options in (18), analogous
structures with the stray element on the left are also possible, as are structures
with any number of stray elements on the left and/​or right. If recursion is
permitted, the number of options increases exponentially, resulting in a vast
proliferation of possible prosodic structures, even if the number of prosodic
constituent types remains the same.
As was noted above, in addition to strictly limiting constituent structures,
the SLH had the effect of limiting possible phonological rules, so skipping
levels also results in a considerable increase in the number of rule options.
While domain span rules are not affected, since they are not sensitive to
internal constituent structure, both domain limit and domain juncture rules
are affected. Moreover, the situation is further complicated if recursive con-
stituent structures are permitted.
Phonologists would most likely agree that complex configurations involving
the interspersing of multiple types of prosodic constituents are undesirable
since they allow for, and predict, the existence of languages that make use
of the various options in the application of phonological phenomena. In an
Optimality Theoretic approach, there could be constraints to militate against
such complex options; however, different constraints and rankings would still
be able to yield numerous undesirable configurations. Thus, while weakening
the SLH to permit skipping levels allows us to avoid the incorrect promotion
of elements to higher constituents, and to more adequately account for cer-
tain phonological phenomena, if left unchecked, this innovation massively
overgenerates phonological structures, and thus fails with regard to delimiting
the set of possible human languages. Assessment of recursion

It is usual to account for the similarity in linguistic behavior of different types
of strings by analyzing them as the same type of constituent (C); strings that

Life after the Strict Layer Hypothesis 25

show distinct types of behavior are analyzed as different types of constituents.
Thus, in a recursive structure in which a particular type of constituent C is
embedded within another C of the same type, it would be expected that both
Cs would exhibit the same behavior. As seen above, however, proposals have
been advanced in which the same constituent label is used for structures with
divergent properties.
Focusing on the prosodic constituents below the phonological phrase, we
have seen that there is a systematic distinction in the phonological properties
associated with the two varieties of C (C and C’) at all levels (i.e., interface
and non-​interface constituents). As noted above, such a distinction follows
automatically from any procedure that creates a C’ from the combination of a
C and material that was previously excluded from C. That is, if material must
be excluded from C, another constituent that includes this material must, by
definition, exhibit properties different from C. Thus, a phonological structure
[[ ]‌C]C’, where C’ has different properties from C, does not, in fact, meet the
definition of recursion.
For example, it was seen above in relation to the non-​interface constituents
that while the Sonority Sequencing Principle (SSP) is maintained within a σ,
the same is not true for a σ’, which includes extrasyllabic segments that do
not conform to the SSP (e.g., lapse [[læp]σ s]σ’ or stab [s [tæb]σ]σ’). Similarly,
it was seen that a recursive foot may permit three syllables, while the basic
foot may not have more than two. In these cases, the prime (’) designation
in C’ is essentially a diacritic that serves to distinguish the properties of this
structure from those of the basic C structure. Considering the C and C’ to
be the same type of constituent thus deviates from the standard practice of
using of distinct constituent labels to identify strings with different proper-
ties, and it obscures differences between phonological phenomena that cru-
cially distinguish between the strings delimited by C and C’ (Vogel 2009, 2012,
among others). Such drawbacks strongly indicate that an alternative analysis
is required.
With regard to the interface constituents below the phonological phrase,
the problem can be seen to stem from an effort to avoid the inclusion of a
prosodic constituent between the phonological word and the phonological
phrase. The recursive ω’ is used instead to collect a variety of elements that
must be excluded from the basic ω constituent since they exhibit different
properties. As previously noted, the problem is that any constituent that does
include such elements will inevitably exhibit properties that are distinct from
those of the ω. Such a situation will always arise in a language that makes
a distinction between “cohering”, or level 1 affixes, and “non-​cohering”, or
level 2 affixes (essentially SPE + and # boundary affixes). Since the latter are
excluded from the ω, any constituent that includes them must have properties
that are distinct from those of the ω.21
Since stress is a property often associated with the ω, it offers many oppor-
tunities to examine the relationship between the behaviors of the ω and ω’
constituents. For example, in English, level 1 affixes form a ω with the root
and participate in (word) stress assignment within that constituent (e.g.,

26 Irene Vogel
[grámmar]ω / [grammát-​ical]ω / [grammat-​icál-​ity]ω). By contrast, level 2 affixes
are excluded from the ω, and do not participate in stress assignment (e.g.,
[féver]ω / [[féver]ω ish]ω’ / [[[féver]ω ish]ω’ ly]ω’). Similarly, clitics do not partici-
pate in word-​level stress assignment (e.g., [[séver]ω it]ω’ / [[[séver]ω ing]ω’ it]ω’).
Furthermore, as noted above, the individual members of compounds have
stress assigned to their own ωs, while the whole compound undergoes the
Compound Stress Rule (e.g., [[féver]ω [blíster]ω]ω’). Considering the various ω
and ω’ structures to be the same type of constituent suggests that they should
have the same stress properties, which is clearly not the case.
Another well-​known type of stress pattern that also exhibits a difference
between the ω and ω’ is the “trisyllabic window” found in Italian and other
languages, according to which stress must appear on one of the last three
syllables of a ω.22 When clitics are added in a ω’, however, the same restriction
is not observed, as illustrated in (19).

(19) ω and ω’ with different properties: Italian stress

teléfona me lo
telephone (to) meCL itCL ‘telephone it to me’

The antepenultimate stress in the verb form teléfona falls within the trisyl-
labic window, and when clitics are added, the stress remains on that syllable.
Thus, in the ω’ in (19), it appears on the fifth-​to-​last syllable.23
With regard to segmental phenomena, it was seen above that the Italian
ω domain rule of Intervocalic s-​Voicing does not apply with level 2 prefixes,
clitics, or compounds, all of which would be parsed as ω’. Thus the ω, which
exhibits ISV (e.g., [i[z]‌ ola]ω ‘island’, [noi-​o[z]-​in-​o]ω ‘somewhat boring’
(< bore-​adj-​dim-​m,sg)),24 is distinct from the various ω’ structures, which
do not exhibit ISV (e.g., [lo [s]i ri-​[[s]ala]ω]ω’ ‘one resalts it’ (< itCL oneCL
re-​salts), [[dicendo]ω [s]e lo]ω’ ‘saying it to oneself ’ (< saying selfCL itCL),
[[porta]ω [[s]apone]ω]ω’ ‘soap dish’ (< carry soap)). Languages with vowel
harmony (VH) also consistently exhibit discrepancies between the phon-
ology of the ω and ω’. While VH typically applies throughout a ω, it does
not usually apply throughout a ω’ consisting of a compound word with mul-
tiple ωs (e.g., Hungarian: [olvasó]ω1 [terem]ω2]ω’ ‘reading room’ (ω1 = +Back;
ω2 = -​Back).
Thus far, the differences between the ω and ω’ constituents have involved
the application of rules within the smaller ω domain but not the larger ω’;
however, there are also cases in which rules apply within the ω’ but not the ω.
For example, in English, the well-​known Voicing Assimilation rule applies in
the ω’ domain, as seen with the addition of a (level 2) plural or third person
singular –​s, which is voiced following a voiced (non-​strident) segment (e.g.,
[nz]: (the) [[fan]ω-​s]ω’, (he) [[fan]ω-​s]ω’), but voiceless after a voiceless segment

Life after the Strict Layer Hypothesis 27

(e.g., [ts]: (the) [[bat]ω-​s]ω’, (he) [[bat]ω-​s]ω’). As will be discussed below, identical
assimilation patterns are observed with the possessive, copula and auxiliary –​
s, which are also included in the ω’. By contrast, there is no requirement of
voicing assimilation within the ω, where a voiced segment may be followed by
either a voiced [z]‌or a voiceless [s] (e.g., [nz]: [lens]ω; [ns]: [dance]ω).
In sum, various types of phenomena show that C and C’ exhibit distinct
phonological properties, something that would not be predicted if they are
instantiations of the same type of constituent in a recursive structure. As
was seen, the differences are observed in both the non-​interface (syllable,
foot) and interface (phonological word) constituents, although as previously
noted, and discussed further in Section 1.6, the situation at higher levels may
be different. The proliferation concern

As mentioned, a motivation for repeating constituent labels, even at the
cost of using the same label for strings that have different properties, resides
at least in part in a recurrent concern in phonological theory that can be
referred to as the “proliferation concern”. In relation to the prosodic hier-
archy, the specific concern is the potential creation of any number of pros-
odic constituents corresponding to whatever strings seem to constitute the
domains of phonological phenomena in a given language. In fact, early in
the development of prosodic phonology, Kanerva (1990: 161) explicitly
raised the question, “Will prosodic phonology fall victim to a cancerous pro-
liferation of prosodic levels?”
The concern about the proliferation of prosodic constituents or levels
reflects analogous concerns regarding earlier models of phonology where,
indeed, large numbers of (SPE) boundary types, and later, lexical levels,
were proposed, precisely to account for a broad range of phonological
patterns observed in specific languages. For example, 4 and 11 boundary
types were proposed for Dakota (Shaw 1980 and Carter 1974, respectively),
5 for Danish (Basbøl 1975), and 13 for Italian (Bertinetto 1999; Loporcaro
1999). Subsequently, 4 lexical levels were proposed for Dakota (Kaisse and
Shaw 1985; Shaw 1985), and then 11 were proposed for the same language
by Patterson (1990). The crux of the problem in these cases is the circular
approach to defining the domains of application of language-​specific phe-
nomena. That is, for each phenomenon (or set of phenomena), the context in
which it is found to apply is identified. This, in turn, is deemed a phonological
domain or constituent, and characterized in terms of boundaries, levels, or
other information. The domains thus established are then referred to in for-
mulating the conditions or contexts in which the phonological phenomena in
question apply. It may also be noted that, although the same degree of pro-
liferation has not been seen in analyses invoking co-​phonologies (e.g., Antilla
2002; Inkelas 2014; Inkelas and Orgun 1998; Inkelas and Zoll 2005), an analo-
gous problem could arise if any number of co-​phonologies may potentially
be established in a language on the basis of its specific phonological patterns.

28 Irene Vogel
In prosodic phonology, the number of proposed constituents has not
reached “cancerous” proportions as feared. In fact, in any Interface-​Based
prosodic model, where constituents are constructed via general mapping
procedures between morphosyntactic constituents and phonological struc-
ture, the number of constituents is automatically restricted. Thus, the rela-
tional approach to the prosodic hierarchy in N&V comprised five interface
constituents. As noted above, an additional prosodic stem constituent has
been proposed in some cases, but this too is based on a specific, morphologic-
ally identifiable element. One or two tone-​related constituents (e.g., accen-
tual, major, minor phrases) have also been proposed in some analyses, but
these tend to coincide roughly with other established constituents (e.g., Itô
and Mester 2012; Selkirk and Koichi 1988; Shinya et al. 2004). In Match
Theory (Selkirk 2011), we find six domains, presented as three pairs of recur-
sive constituents (i.e., the inner and outer variants of the basic phonological
word, phonological phrase, and intonational phrase domains), all of which
are established in relation to specific morphosyntactic structures.
As mentioned earlier, of the five prosodic constituents in N&V, the CG
has often been viewed with suspicion and removed from the prosodic hier-
archy. Since the CG, and its subsequent development as the composite group
(e.g., Vogel 2009), was constructed in relation to specific morphosyntactic
elements (e.g., Hayes 1989; N&V), it did not, in fact, pose a risk of initi-
ating a slippery slope toward the unchecked proliferation of constituents.
Moreover, removing the CG does not remove the fact that there are clitics,
and other stray elements, that must be accommodated in some way in the
prosodic hierarchy. In fact, it is precisely such elements that are typically
parsed in the ω’.
Even in the Phenomenon-​ Based prosodic models that construct
constituents specifically to accommodate the phonological phenomena of
a given language, relatively few additional types of constituents have been
introduced. For example, in the ToBI Approach (see Section 1.2.1.), although
proliferation is not excluded on principle, only a small number of different
constituents have been proposed. Where we do, however, see a proliferation of
constituents is in the Distributional Typology approach developed by Bickel
and colleagues (see Section 1.2.1). Here too, though, it is not the constituent
categories per se that have proliferated, but the number of recursive levels of
a single constituent, in particular, the phonological word, as in (20).

(20) Multiple recursive phonological words in the Distributional Typological

[[[[[…]ω …]ω’ …]ω’ …]ω’ …]ω’

Although at first glance such a structure may not appear to result in pros-
odic constituent proliferation, closer examination reveals exactly the same
problem that arose with multiple boundary types and lexical levels. That is,
each of the phonological word levels is constructed to account for a different

Life after the Strict Layer Hypothesis 29

phonological behavior, and thus there is no principled limit on the number and
nature of such constituents. In fact, Schiering et al. (2007) proposed 30 levels
of recursive ω’ for Dege Tibetan, and 14 levels for Limbu. In a more recent
analysis, however, the number of ω’ constituents in Limbu has been reduced
to two, indicating that even in this approach, an attempt is being made to
restrict the structure of the prosodic hierarchy (Schiering et al. 2007, 2010).
While the concern about the proliferation of prosodic constituent types
has for the most part been unsubstantiated, it was seen above (Section
that the weakening of the SLH creates the potential for a different type of
proliferation. That is, in the absence of new restrictions, it is possible for pros-
odic configurations and their related rules to proliferate due to unrestricted
combinations of smaller (or same-​level) constituents. Although it is often
overlooked, this type of proliferation can be as detrimental as the potentially
unrestricted creation of constituents.

1.4 Restricting prosodic structure again –​ I

Weakening the overly restrictive SLH clearly yields some positive results; how-
ever, it also introduces serious problems, unless some other restrictions are put
into place. In this section, we examine approaches to re-​restrict prosodic struc-
ture that rely on mapping procedures involving a closer association between
morphosyntactic and prosodic constituents, and on constituent adjunction.
The alternative Composite Prosody Model will be discussed in Section 1.5.

1.4.1 Proliferation of constituents and recursion

It is often argued that recursive prosodic structures offer the advantage of
preventing constituent proliferation. As noted, however, any model of pros-
odic phonology that makes use of specific mapping procedures from morpho-
syntactic to prosodic structures will, by definition, avoid the unconstrained
proliferation of prosodic constituents. Thus, alignment approaches (among
others, Selkirk 1986, 1995), Wrap Theory (among others, Truckenbrodt 1999),
Match Theory (Selkirk 2011), and the Adjunction Approach (among others,
Itô and Mester 2009a, b), as well as the relational approach (among others,
N&V), all systematically exclude the possibility of idiosyncratic constituent
If there is any savings with regard to proliferation offered by recursive
structures, it would appear to reside in the number of basic labels used for the
prosodic constituents, not the number of distinct domains. The fact remains
that if strings labeled as two types of C (C and C’) exhibit different phono-
logical properties, as discussed above, they in effect delimit different pros-
odic domains. Thus, in Match Theory (Selkirk 2011), there are three types
of constituent labels, mapped from three types of syntactic structures, as in
(21);25 however, each constituent type has two levels, as in (22), thus effectively
establishing six prosodic domains.

30 Irene Vogel

(21) Match morpho-​syntactic to prosodic structures (Selkirk 2011)

a. Match Clause  Intonational Phrase (ι)
b. Match Phrase  Phonological Phrase (φ)
c. Match Word  Phonological Word (ω)

(22) Max and min levels in recursive prosodic structures (Selkirk 2011)
a. Cmax = C level constituent not dominated by another C
b. Cmin = C level constituent not dominating another C

In fact, the resulting six structures roughly line up with the domains iden-
tified in N&V, as shown in (23), where the composite group replaces the clitic
group. Indeed, Match Theory offers an option not present in Nespor and

(23) Comparison of prosodic constituent hierarchies

a. Match (Selkirk 2011) ιmax > ιmin > φmax > φmin > ωmax > ωmin
b. Relational (N&V 1986)26 ʊ >   ι     >    φ   > κ   > ω

Although recursion has only a negligible effect on the overall number of

prosodic levels, it offers the possibility of a more direct relationship between
phonological and syntactic structures, especially as applied in Match Theory.
It is unclear, however, to what extent this relationship actually holds, since
the potentially infinite recursion and depth of tree structures in syntax is not,
in fact, mirrored in the relatively flat structures in phonology. Indeed, Match
Theory includes a procedure that introduces additional phonological con-
stituent brackets within the original constituents that result in considerably
flatter structures, as in (24).

(24) Embedded intonational phrases and restructuring

[[[[x]‌Clause x]Clause x]Clause x]Clause  ((((x)ι min x)ι x)ι x)ι max  ((x)ι min (x)ι (x)ι
(x)ι)ι max

As can be seen, the adjusted phonological structure in (24) is flatter than the
originally mapped structure, but now the resulting phonological constituents
no longer exhibit the recursion of the original syntactic structure. Moreover,
it is not clear what the additional constituents represent. If, as is argued, only
the uppermost and lowest levels of a constituent type need to be identified
(i.e., Cmax and Cmin), any intermediate levels would then be without a pros-
odic status, for example, the three (x)ι constituents in (24). If, instead, such
constituents are relabeled as (x)ιmin, the definition of this type of constituent is
not consistent across the different instances, and again, it is not clear how the
phonological structure is recursive in parallel to the corresponding recursive
syntactic structure.

Life after the Strict Layer Hypothesis 31

A case such as that in (24) was, in fact, seen above in the Xitsonga example
in (17), where Penultimate Vowel Lengthening applies at the right edge of any
intonational phrase, regardless of its depth of embedding. At this point, we
must thus ask whether a C /​C’ distinction is actually necessary here, and pos-
sibly at the φ level, contrary to the suggestion that recursive structures may
be required at the upper levels of the prosodic hierarchy, even if they are not
present below the φ level.
It must be noted that despite attempts to restrict the options in the pros-
odic hierarchy, other recent proposals argue in favor of more options. For
example, it has been suggested that a “plain C” between Cmax and Cmin, as in the
representation between the arrows in (24), is in fact a different type of phono-
logical constituent, C[-​max, -​min] (e.g., Elfner 2012; Itô and Mester 2007, 2013).
Yet another constituent option was previously proposed, C[+max, +min], for cases
in which the maximal and minimal Cs are crucially coextensive (Haider 1993),
although Itô and Mester (2013) argue that this may be a trivial option. If each
possible combination of [max] and [min] introduces a distinct type of pros-
odic constituent, instead of three levels (ω, φ and ι), we would now have twelve.
While this still does not constitute a proliferation of constituents, and it is not
further expandable assuming only the three basic levels corresponding to syn-
tactic structures, it predicts the possibility of innumerable prosodic structures
across languages, where different types of constituents may be interspersed,
and with these, innumerable types of prosodic rules (especially edge and junc-
ture rules), as discussed above in Section
In sum, while the use of only three basic constituent labels is claimed to
remove “the need, and thus the motivation, for further distinctions among
basic categories” (Itô and Mester 2013: 23), it appears that the economy, at best,
pertains only to the mapping rules and basic constituent labels. Moreover, the
concomitant introduction of recursion and a distinction between levels of recur-
sive constituents quickly results in an enormous, and implausible, overgeneration
of possible prosodic structures and rules, whereas the goal of linguistic theory is
to identify and address all and only those structures found in human language.

1.4.2 Parsing stray elements

As was seen above, the weakening of the SLH results in the appearance of
stray elements throughout the prosodic hierarchy (i.e., when elements such as
level 2 affixes, clitics, and other function words are no longer promoted to ωs),
yielding innumerable types of prosodic structures that include the syllables
and feet (and segments) corresponding to such elements parsed in different
orders and in different constituents. In fact, limitations are present in Match
Theory, since all lexical words (with any affixes) are parsed as ωs, that is, ter-
minal nodes of the syntactic tree; the issue of stray level 2 affixes does not
arise. There may, however, be a limited potential for stray affixes to appear in
ω’s with compounds, as in blueberries, where the plural pertains to the whole
compound. Since compounds are parsed as ω’, we could assume that their

32 Irene Vogel
plural markers are also parsed at this level (i.e., [[blue]ω [berry]ω s]ω’). By con-
trast, in the phrase blue berries, the plural pertains only to the noun berry, so
the structure just requires two ωs, which coincide with the highest ω’ for each
word (i.e., [[blue]ω/​ ω’ [berries]ω/​  ω’]φ).
At the higher levels of the prosodic hierarchy, the structures are not simi-
larly restricted, since clitics and function words may be parsed in any con-
stituent (i.e,. φ/​φ’ or ι/​ι’), typically in parallel with their syntactic structure.
“Directional clitics” (DCLs) are the exception, since they must always attach
to a host either on the right or the left regardless of the syntax, as in the
leftward attachment of the auxiliary and copula –​s in English (e.g., Klavans
1982, 1985; Zwicky 1984; N&V). It was noted above (cf. (18)) that the lack
of a prosodic restriction on the parsing of stray elements at the higher levels
predicts the possibility of numerous structures, as illustrated in (25), with
one and two stray syllables to the right of the head constituent; however, any
number and sequence of syllables and/​or other elements could be included,
also to the left of the head.

(25) Match Theory: some options for parsing syllables

a. [[ ]‌ω’ σ]φ b. [[ ]‌ω’ σ σ]φ c. [[ ]‌ω’ σ]φ’ d. [[ ]‌ω’ σ σ]φ’
e. [[ ]‌φ σ]φ’ f. [[ ]‌φ σ]ι g. [[ ]‌φ’ σ]ι’ h. [[ ]‌ι σ]ι’

While all such configurations would be allowable, the question is whether

they are all phonologically meaningful. That is, we must ask, for example,
whether there are phonological differences between a syllable parsed within
a φ, φ’, ι, and so on, or as a sister of a ω’, φ, φ’, and so on, within a given
constituent C. Since different structural properties provide the opportun-
ities for different phonological behaviors, the prediction is that any and all
of the possible configurations could exhibit different phonological patterns.
Although many options might be avoided in a given language by adopting a
particular constraint ranking, a different ranking could nevertheless predict
their occurrence in some other language. The underlying problem is thus not
addressed –​the fact that a model that permits all of the structures in question
results in extreme overgeneration of possible grammars.
In contrast with Match Theory, Itô and Mester’s (e.g., 2009a, b)
Adjunction Approach substantially restricts the possible prosodic constituent
configurations, specifically with regard to the parsing of stray elements. That
is, while Itô and Mester assume the syntactic mapping of Match Theory for
the φ and ι constituents, their adjunction, as opposed to matching, process for
phonological word construction, parses all stray elements by adjoining them
individually as sisters of a ω (or ω’), forming recursive ω’s. Although this pro-
cedure may result in many recursive ω’s, Itô and Mester (2013) subsequently
reduce the number of levels to ωmax (= ω’) and ωmin (= ω), analogously to the
reduction of higher-​level recursion in Match Theory, illustrated above in (24).
For example, although multiple function words (FWs) could be parsed in recur-
sive ω’s, as in (26), a restructuring process would result in a flatter final structure.

Life after the Strict Layer Hypothesis 33

(26) Adjunction Approach (Itô and Mester 2009a, b)

FW FW  [ FW FW [ ] ] ’

As in Match Theory, however, there is no particular status associated with

the recursive ω’ constituents that are reduced from C’ when they are not the
Cmax. The concerns raised above in this regard apply here as well.
Nevertheless, the Adjunction Approach does impose considerably more
restrictions on prosodic structure than Match Theory. By restricting the
occurrence of stray elements to below the φ, the Adjunction Approach not
only removes the various options for these to be interspersed throughout
the higher levels of the prosodic hierarchy but also limits the number of
levels that may be skipped. That is, if ω’ directly parses a single segment,
the maximum extent of level skipping, this only allows the σ, Σ, and ω to
be skipped. By contrast, in Match Theory, it would also be possible to skip
the ω’, φ, φ’, ι, and ι’. In addition, since the Adjunction Approach parses all
function words, and level 2 affixes, including more “substantial” ones that
are promoted to ω (Itô and Mester 2009a), in the same way, as daughters of
ω’, it makes the prediction that the various stray elements in a language will
exhibit the same phonological behavior, depending only on their structure
(e.g., σ, Σ, ω).
In sum, the more stringent restrictions on the geometry of the prosodic
hierarchy imposed by the Adjunction Approach have the distinct advantage
of making strong, and testable, claims about the structure of possible phono-
logical systems. Indeed, it was seen that the various –​s morphemes in English
(i.e., N and V inflections, possessive, copula, and auxiliary) assimilate (or are
pronounced as [-​əz]) in the same way regardless of their position in the syn-
tactic tree. This similarity is captured by parsing them all in the ω’, whereas
parsing them in different prosodic constituents according to their syntax
positions makes the incorrect prediction that they would behave differently,
and misses the observed generalization. Nevertheless, it was seen that both the
Adjunction Approach and Match Theory encounter fundamental problems
with regard to the nature of their recursive structures, which at least for the
ω’, exhibit considerable differences between the C and C’ levels. Moreover,
there is a lack of clarity regarding the status of any constituents that may arise
between Cmax and Cmin, since it is argued that only the highest and lowest levels
are to be retained in recursive structures with multiple C’ constituents. Finally,
the proposed recursive constituents do not consistently exhibit a head, where
the head of C is defined as Cn-​1 in the prosodic hierarchy. That is, where levels

34 Irene Vogel
are skipped, it is possible to arrive at structures in which C’ dominates another
C, and possibly other elements such as Cn-​2, but not Cn-​1.

1.5 Restricting prosodic structure again: The role of the

Composite Group
As just seen, previous proposals for “re-​restricting” the geometry of the pros-
odic hierarchy in conjunction with the weakening of the SLH present other
types of problems involving recursion and the overgeneration of prosodic
configurations with stray elements. It is demonstrated here that both types
of problems are resolved by including a distinct prosodic constituent in the
prosodic hierarchy between the ω and the φ, the composite group. The recog-
nition of this constituent, moreover, allows us to formulate a small number of
principles that restrict the architecture of the prosodic hierarchy at both the
interface and non-​interface levels.

1.5.1 A distinct constituent between the phonological word and phonological

phrase: The Composite Group
The constellation of structures that are problematic for the previous proposals
for the most part involve clitics and other stray elements, those that were
comprised in the original CG in N&V. The persistence of these challenges is
noteworthy, and suggests that despite its drawbacks, stemming largely from
the SLH, the CG addressed a fundamental need in prosodic phonology. As
demonstrated here, the composite group (κ), as part of the Composite Prosody
Model, maintains the initial insights offered by an intermediate constituent
between the ω and the φ, while also addressing the undesirable aspects of the
CG (cf. also Vogel 1999, 2008a, b, 2009, 2012, N&V 2007). For the present
purposes, the composite group is informally defined as consisting of a ω, or in
the case of compounds, multiple ωs, and any stray elements (i.e., level 2 affixes,
clitics, other function words); details of the definition of the κ, and more gen-
eral properties of the prosodic hierarchy, are discussed in subsequent sections.
Italian Intervocalic s-​ Voicing illustrates the crucial difference between
the ω and the κ as phonological domains, applying between vowels within
the former (morpheme-​internally and across a root and all suffixes), but
not the latter, as in (27) and (28), respectively; the second set of brackets shows
the (broad) phonetic form.

(27) Italian Intervocalic s-​Voicing: applies within ω (/​s/​ [z]‌)

a. [[caserm-​a]ω]κ [kazerma] ‘barracks’
b. [[noi-​os-​in-​o]ω]κ [nojozino] ‘somewhat annoying’

(28) Italian Intervocalic s-​Voicing: does not apply across ωs within κ (/​s/​ [s]‌)
a. [[lo]σ [si]σ [ri]σ [sala]ω]κ [lo si risala] ‘one re-​salts it’
b. [[comprando]ω [se]σ [lo]σ]κ [komprando se lo] ‘buying it for oneself’
c. [[porta]ω [sapone]ω]κ [porta sapone] ‘soap dish’

Life after the Strict Layer Hypothesis 35

Note that in (27), in the absence of other elements, the ω is co-​extensive
with the κ. In (28), the grouping of various stray elements (level 2 prefix,
clitics) and compounds in the same type of constituent (κ) correctly predicts
the similarity of their phonological behavior.
While in Italian, all suffixes are parsed within the ω with the root, and
most prefixes are not, as in (27) and (28), there are some instances of
“lexicalized” prefixes that do combine with the root, and these cases are sub-
ject to Intervocalic s-​Voicing. In fact, there are a number of minimal pairs
with regard to the nature of the prefix, and ISV application. For example, in
lexicalized items where ri-​has a less transparent meaning, it may be considered
a level 1 affix, and as such, undergo ISV, as in (29). This contrasts with the
more usual status of ri-​ as a productive level 2 affix, equivalent to English
“re-​”, which does not undergo ISV, as in (30).

(29) Italian Intervocalic s-​Voicing: application with level 1 ri-​prefix

level 1 ri-​: [ri-​salire]ω]κ [rizalire] ‘to date back to’ (/​s/​  [z]‌)

(30) Italian Intervocalic s-​Voicing: no application with level 2 ri-​prefix

level 2 ri-​: [[ri]σ [salire]ω]κ [risalire] ‘to go up again’ (< re-​go up; /​s/​ = [s]‌)

It should be noted that even if ri-​ is not recognized as an affix due to the
lexicalized meaning of (29), ISV is still correctly predicted since ri-​would then
be considered part of the root, and thus automatically part of the ω.
In addition, it can be seen that an Italian phonotactic constraint on the pal-
atal lateral [ʎ] is straightforwardly accounted for by the distinction between
the ω and κ constituents. That is, while [ʎ] is excluded from the onset of a
syllable at the beginning of a ω, it may appear in the onset of a syllable in
other positions within the κ. Thus, “gl” is pronounced as [gl] rather than [ʎ]
ω-​initially in (31), but as [ʎ] in other positions, as in (32).

(31) “gl” = [gl] at left edge of ω

[[glicine]ω]κ [gliʧine] ‘wisteria’

(32) “gl” = [ʎ] elsewhere in κ

[[gli]σ [smalti]ω]κ [ʎi zmalti] ‘the enamels’
[[dando]ω [glie]σ [lo]σ]κ [dando ʎelo] ‘giving it to him’ (< ‘giving (to) him it’)

Note that [ʎ] is also allowed as a syllable onset word internally, where it
may arise as part of a geminate (e.g., figli [[fiʎ.ʎi]ω]κ ‘sons’).
The English voicing assimilation patterns also crucially differ in the ω
and κ constituents, with the former being more permissive than the latter.
For example, as noted previously, we find both assimilated and unassim-
ilated sequences involving /​ s/​within the ω (e.g., [z]‌ : [cleanse]ω, [Mars]ω;
[s]: [fence]ω, [parse]ω); however, beyond that level, only assimilated sequences
may appear, regardless of the nature of –​s. By parsing all of the “stray” –​s
morphemes in the same way within the κ, as shown in (33), the similarity in

36 Irene Vogel
their phonological behavior is accounted for, as is their difference from the
ω level behavior.

(33) English voicing assimilation in κ (/​s/​ [z]‌)

a. Plural: [[fan]ω s]κ b. Pe3sg: [[fan]ω s]κ
c. Poss: [[Dan]ω s]κ (cactus) d. Copula: [[Dan]ω s]κ (careful)
e. Aux: [[Dan]ω s]κ (coming)

As previously noted, Match Theory does not capture the same

generalizations since it defines ω differently (i.e., comprising all affixes), and
then relegates the other material to different constituents on the basis of their
syntactic status. The Adjunction Approach achieves a result more similar to
that of the present analysis since it distinguishes the internal ω from the others
labeled as ω’, roughly corresponding to the composite group. The Composite
Prosody Model and the Adjunction Approach nevertheless differ crucially in
the nature of their mapping procedures, as well as the relationship between the
ω and the next higher constituent that includes ω along with any additional
material. The difference is seen in (34), where ajunction creates recursive ωs
in a stepwise manner (34a), while the Composite Prosody Model constructs a
distinct, n-​ary branching κ (34b).

(34) Adjunction ω’ vs. κ structures

a. ’ b.

Although the intermediate ω labels in (34a) would be eliminated, as noted

above (e.g., Itô and Mester 2009a, 2013), it is not clear what the status of
their constituents would be, and thus what the overall structure would be, for
example, a) a hierarchical structure with unlabeled internal brackets (e.g., [σ [σ
[σ [ ]ω]]]ω’), brackets with a plain ω label (e.g., [σ [σ [σ [ ]ω]ω]ω]ω’), or b) simply a
flat structure more similar to (34b), where the extra ω’ labels are simply removed
(e.g., [σ σ σ [ ]ω]ω’). Regardless of which option is adopted, there still remain the
problems of the difference in behavior and the relationship between the lower
ω and upper ω’ constituents. Both of these concerns are avoided by explicitly
distinguishing between the ω and the κ constituents.

1.5.2 Composite Prosody Model geometry: General properties

Thus far only a general overview of the Composite Prosody Model of the
prosodic hierarchy has been provided. This includes the presence of the com-
posite group constituent between the phonological word and the phonological

Life after the Strict Layer Hypothesis 37

phrase, and the possibility of skipping levels, but not of recursion, at least
in the constituents below the phonological phrase. In this and the following
sections, the principles underlying the construction and restrictions on the
architecture of the prosodic hierarchy are introduced and discussed.
Considering first the issue of recursion, it was noted that if Recursion
1 (Cn dominates Cn) is permitted along with removing strict dominance,
the only limitation remaining from the SLH is the exclusion of Recursion
2 (Cn dominates Cn+1). If embedded structures of the latter type in syntax
are also mirrored in prosodic structure, it would be necessary to remove
this last exclusion as well. It was amply demonstrated above, however, that
different types of recursive structures introduce a range of problems not only
in accounting for specific phonological phenomena but also with regard to
the overall geometry of the prosodic hierarchy. In excluding the possibility
of recursion, at least below the phonological phrase level, the Composite
Prosody Model offers the potential of a more highly constrained prosodic
hierarchy more generally.
Taking the strongest position, the present proposal excludes all recursion.
Thus, the Principle of Constituent Sequencing in (35) requires that a con-
stituent dominate only constituents lower than itself.

(35) Constituent Sequencing: A prosodic constituent of level Cn may only

dominate constituents of level Cn-​1 or lower.

As formulated, (35) applies to all interface and non-​interface levels of the

prosodic hierarchy, and thus not only imposes strong limitations on their
geometry but also serves as a unifying property across levels. If it turns out,
however, that the higher-​level constituents do, in fact, exhibit some recur-
sion due to their closer connection to recursive syntactic structures, as pre-
viously mentioned, then (35) may need to be somewhat less restrictive for
those levels.
The Principle of Constituent Sequencing is somewhat similar to the
Principle of Containment proposed by Itô and Mester (2009a: 138) as a gen-
eral property of the prosodic hierarchy: “[e]‌ach immediate dominance rela-
tion respects the containment structure of the prosodic hierarchy, in the sense
that lower-​ranked elements do not immediately dominate higher-​ ranked
elements”. While both principles prevent a constituent from dominating one
that is higher in the hierarchy, or Recursion 2, Containment excludes only this
type of recursion; Constituent Sequencing excludes both Recursion 1 and 2.
With regard to the skipping of levels and the parsing of smaller elements,
it was noted that any model of the prosodic hierarchy with a fixed, universal,
set of constituents does not allow a level to be skipped completely. Thus, as
a minimum requirement on the structure of the prosodic hierarchy, we adopt
a principle that ensures that each non-​terminal constituent (Cn) contains a
head, that is, a constituent of the immediately lower level (Cn-​1), even if there
is substantial overlap. Following Itô and Mester (e.g., 2003: 37), this Principle
of Proper Headedness is formulated as in (36):

38 Irene Vogel

(36) Proper Headedness: Every (non-​terminal) prosodic constituent of level Cn

must have a head, that is, it must immediately dominate a constituent of
level Cn-​1.

It will be recalled that there was a problem in the higher-​level recursive

structures where a constituent Cn’ dominated a Cn, but not the next lower con-
stituent, Cn-​1. This is not a problem in the present model, however, if recursion
is excluded at all levels. If the higher φ and ι levels do, in fact, exhibit recur-
sion, an additional provision would be needed for these.
Independently of recursion, in any prosodic hierarchy that permits levels
to be skipped, the question arises as to whether there are restrictions on the
skipped levels. Unlike models that allow stray elements to be parsed at all levels
of the prosodic hierarchy, the present proposal limits the number of levels that
may be skipped by only allowing stray elements to be parsed below the φ.
This is accomplished by the mapping procedures that build the lower-​level
constituents, discussed further below, and by the Principle of Minimal Distance
in (37), which requires that all material be parsed at the lowest level possible.

(37) Minimal Distance: Parse phonological material into the first available
prosodic constituent.

“Availability” here refers to much the same requirement as in Itô and

Mester’s (2003: 38) Maximal Parsing principle, whereby parsing must apply
“within the limits imposed by other (universal and language-​ particular)
constraints on prosodic form”. For example, if a language permits onset
clusters, both consonants of a CCV string would be parsed within a syllable
as long they respect the Sonority Sequencing Principle (e.g., English [slo]σ, but
*[lso]σ). If a language does not permit clusters, however, the first C could not
be parsed in the syllable at all; if present, it must be considered extrasyllabic,
and most likely parsed at the ω level (e.g., [s[lo]σ]ω).
It should be noted that the Principle of Minimal Distance nevertheless
yields somewhat different results from Itô and Mester’s principle of Maximal
Parsing. Specifically, the latter establishes as much structure as possible,
whereas the present proposal minimizes structure to the extent possible. For
example, while Maximal Parsing would parse a sequence of two stray syllables
(e.g., clitics) first as a foot, and then possibly a ω, before parsing them into a
higher constituent (38a), Minimal Distance would directly parse both syllables
in the first constituent that is available, without creating new ones, that is, in a
κ –​as sisters of a ω (38b).

(38) Parsing stray elements

a. Maximal parsing (Itô and Mester 2003)
    σ σ ω  [[σ σ]Σ ω]ω’ or σ σ ω  [[[σ σ]Σ]ω ω]ω’
b. Minimal distance (see (37))
    σ σ ω  [σ σ ω]κ

Life after the Strict Layer Hypothesis 39

In fact, it was seen above in (7) that grouping two stray syllables into a
foot in Italian leads to incorrect predictions, for example, that there should
be prominence on the first syllable. If the foot is then also parsed as a ω, add-
itional incorrect predictions would be made, for example, that the ω domain
rule of Intervocalic s-​Voicing should apply.27
In sum, the three principles proposed here, Constituent Sequencing, Proper
Headedness, and Minimal Distance, impose strong limitations on the overall
geometry of the prosodic hierarchy. In particular, they address the two main
problems introduced with the weakening of the SLH, recursion and the
skipping of levels. Moreover, since it is proposed that the principles apply to
the non-​interface constituents as well as to the interface constituents, they
offer a means of unifying the prosodic hierarchy from the smallest to the lar-
gest constituents, as discussed further below.

1.5.3 Composite Prosody Model geometry: Morphological

interface properties
While for the present purposes it may be assumed that a direct syntax-​
phonology mapping procedure such as that in Match Theory (Selkirk 2011)
applies to the φ and ι constituents, it has been demonstrated that the same
type of procedure is not tenable for the lower constituents. Thus, alternative
mapping procedures must be provided for these constituents, in particular,
the two interface constituents, the composite group and the phonological
word, since it is assumed that the lower foot and syllable constituents are
constructed in accordance with general phonological principles pertaining to
these structures.
Although the specific word formations and associated phonological phe-
nomena differ across languages, if the prosodic hierarchy is universal, the
mapping procedures that construct the constituents must be general and
applicable to any language. Beginning with the phonological word, it is
proposed that the minimal morphosyntactic requirement is that there be a
morphological root or “core”, as stated in the Principle of the Morphological
Core in (39) (cf. Vogel 2012: 52).

(39) Morphological Core: A phonological word must contain

a morphological root.

This principle places a strong restriction on possible ωs, and thus excludes
certain types structures that have been deemed ωs in previous analyses. In par-
ticular, it excludes ωs consisting of only an affix or function word, even if it is
phonologically “substantial” (among others, Booij 1985, 1999, 2007; Itô and
Mester 2009a; Vigário 2003; Weise 1996). It also excludes ωs consisting only of
a combination of affixes and/​or function words (e.g., Dixon and Aikhenvald
2002). As seen above, in the Composite Prosody Model, such “stray” elements
are parsed directly in the Composite Group, both avoiding the need to define

40 Irene Vogel
the ω differently in different situations and keeping the prosodic structure to
a minimum.
In analyses where more “substantial” functional elements or affixes are
analyzed as phonological words, this is typically done on the basis of prop-
erties that, in fact, coincide with foot properties (e.g., weight or prominence).
While this coincidence is not surprising, since the minimal (phonological)
word is usually coextensive with a foot, relabeling certain feet or combinations
of syllables as ωs essentially undermines the notion of universal prosodic
constituents. Some ωs are defined via mapping rules from morphology, while
others are defined in a language-​specific way dependent on what is deemed
a ω in a given language. In fact, this outcome is similar to the problematic
renaming of various elements as ωs in N&V to satisfy the SLH, although the
items in question did not conform to the more general properties of the ω. In
the present proposal, the items in question only need to constitute feet, and
these in turn are parsed directly in the composite group. They thus exhibit
the necessary weight or other phonological properties, without unneces-
sarily being ascribed morphological attributes associated with the interface
mapping (e.g., Vogel 2009, 2010, 2012).
The morphological core is the basis for the minimal ω; however, other
material is often included as well, specifically any so-​called cohering or level 1
affixes that interact phonologically with their roots. While the classification of
individual affixes as level 1 (or some equivalent) is based on language-​specific,
or even item-​specific, considerations, these details are not what is relevant for
the mapping principles. As proposed in Kabak and Vogel’s (2001) analysis
of Turkish, what is crucial for phonological word mapping is just the (non-​)
cohering status of affixes, regardless of how this has been determined for a
given language. Specifically, the non-​cohering affixes are identified as Prosodic
Word Adjoiners (PWAs), signaling that they attach to a ω, not within a ω. The
relevant information can be encoded in the form of a subcategorization frame,
along with other indications such as what part of speech an affix attaches to,
and whether it attaches to the left or the right of its base.
In Turkish, regular stress assignment applies to the final syllable of a ω.
While this may include many suffixes, given the agglutinating nature of the
language, not all affixes participate in regular stress assignment, and this is
encoded by their PWA status, as illustrated in (40); the stressed syllable is

(40) Turkish phonological words without and with a PWA

(Kabak and Vogel, p. 327)
a. [[sev il di niz]ω ]κ ‘you were loved’
   love pass past 2pl
b. [[sev il]ω mePWA di niz ]κ ‘you were not loved’
   love pass neg past 2pl

Life after the Strict Layer Hypothesis 41

In (40a), in the absence of a PWA, all of the suffixes are included in
the ω, and stress is on the last one, -​niz. In (40b), however, the PWA -​me
subcategorizes for the right edge of a ω to its left, so the final syllable within
that ω (i.e., the one before the PWA) is the one that receives the stress;
the PWA and any following material will then be parsed in the composite
The PWA specification also allows for a straightforward account of phe-
nomena such as the difference in English between the application of complete
assimilation of /​n/​to a following sonorant with the prefix in-​ but not with
the prefix un-​. The former, a level 1 affix, is part of the ω with the root, while
the latter is a PWA and requires a ω boundary to its right. Thus, assimilation
applies when the nasal and following sonorant are within a ω, but not other-
wise, as shown in (41).

(41) Phonological word structure and English in-​and un-​prefixes

a. [[ir respons ible]ω]κ
b. [unPWA [respons ive]ω]κ

If multiple affixes are PWAs, all that is necessary is that the first PWA
establish the end of the ω constituent; any subsequent PWAs make reference
to this boundary, as illustrated in (42).

(42) English with two PWAs

[[father]ω lessPWA nessPWA]κ

Both -​less and -​ness are PWAs, and once -​less establishes the right edge of
the ω, -​ness recognizes this edge; it does not require another ω edge to its left.
Thus, no additional structure is introduced, and word stress applies within the
ω to the first syllable of father.
Thus far, we have seen how level 2 affixes, which are excluded from the ω,
are parsed in the κ constituent; however, it was seen above that other types of
stray elements that do not constitute ωs (i.e., clitics and other types of function
words) are similarly parsed. Typically, these elements interact phonologically
with the item to their left or right depending on which is more closely related
morphosyntactically, as illustrated in (43).

(43) Italian clitic parsing

a. [me lo [porta]ω]κ (< mi lo porta)
meCL itCL brings ‘(he) brings it (to) me’
b. [[porta]ω me lo]κ (< porta mi lo)
bring   meCL itCL ‘bring it (to) me!’
c. [[porta]ω mi]κ [lo [specchio]ω]κ (< porta mi lo specchio)
bring   meCL  the mirror ‘bring me the mirror!’

42 Irene Vogel
In (43a) and (43b), the clitics mi and lo are parsed to the left or the right of
the verb according to their syntactic structures, and in both cases, /​i/​changes
to [e]‌since mi is followed by another clitic in the same κ. The change does not
occur, however, in (43c), where the clitics are parsed in separate κs, following
their syntactic structure (e.g., Vogel 2009). Directional Clitics are parsed to
the left or right regardless of their syntactic position, and interact with the
material in the κ they form part of.
Since the Principle of Minimal Distance parses all stray elements,
including DCLs, at the κ level, the correct generalization is made with
regard to the similarity of their behavior. That is, elements that are parsed
in the same way exhibit the same phonological patterns, regardless of
their position in syntactic structure. Thus, as illustrated in (44), the –​s of
the English auxiliary and copula has the same phonological status as the
plural and third person singular suffixes and the possessive, despite the fact
that the auxiliary and copula are syntactically more closely related to the
material to the right.

(44) Uniform parsing of stray elements

a. Plural, Pe3sg, Possessive: [(the/​he) [fan]ω s]κ
b. Auxiliary: The cat [in the [barn]ω s]κ found a mouse.
c. Copula: The cat [in the [barn]ω s]κ female.

In (44b, c), the -​s is pronounced as [z]‌, assimilating to the voiced segment to
the left, just like the items in (44a); it does not assimilate to the voicelessness
of the /​f/​of the more closely related word to the right.
Differently from the CG in N&V, the composite group also includes the
members of compounds, and thus the κ mapping procedure must parse
together the multiple ωs of compounds, but not those of phrases.28 This is
accomplished by the Principle of the Morphological Maximum, which imposes
a maximal limit of one lexical word (LW) per κ, as stated in (45), capturing the
generalization that the combined members of compounds constitute a single
lexical item.

(45) Morphological maximum: A composite group maximally contains one

lexical word.

Since the individual members of a compound typically constitute lexical

words when standing alone, only the highest or most inclusive LW is the one
that is crucial for κ construction.
While some other proposals treat compounds as a type of (recursive)
phonological word, the composite group makes a systematic distinction
between the individual members of compounds (ωs) and entire compounds
(κs). This distinction provides a straightforward account of a broad range of
phonological phenomena, illustrated in (46), with examples examined in pre-
vious sections.

Life after the Strict Layer Hypothesis 43

(46) Phonological word and composite group phenomena

a) English word and compound stress: [[políce]ω [acádemy]ω]κ
b) Italian Intervocalic s-​Voicing: [[po[z]‌a]ω [sigarette]ω]κ ‘ash tray’
(< place cigarettes)
c) Hungarian vowel harmony: [[fekete]ω [doboz]ω]κ ‘black box’
(front /​back vowels)

In (46a), lexical stress is assigned in different ways to the individual ωs,

but to the first member of compounds (i.e., κs), indicated by acute accents
and bolding, respectively. In (46b), Italian Intervocalic s-​Voicing is observed
in the individual members of compounds, but not across the members of a
compound. Similarly, in (46c), Hungarian vowel harmony applies within each
member of a compound, but not across its members.
The κ constituent also provides a straightforward account of differences
observed between compounds and other structures that would appear similar
to compounds if “substantial” function words or other stray elements are
analyzed as ωs. That is, by parsing the stray elements as feet as opposed to ωs,
since they lack a “morphological core” or root, we can distinguish between
actual compounds, for example, [canteen]ω [racks]ω]κ, where κ level promin-
ence (i.e., compound stress) falls on the first ω, canteen, and structures with
a “substantial” function word such as [[between]Σ [tracks]ω]κ, where κ level
prominence falls on the first and only ω, tracks.
In sum, the combination of three general geometry principles (Constituent
Sequencing, Proper Headedness, Minimal Distance) and two morpho-
logical(-​syntactic) mapping principles (Morphological Core, Morphological
Maximum) establish stringent restrictions on the structure of the prosodic
hierarchy, the former applying at all levels, and the latter specifically at the
interface levels below the φ. Thus, they do not permit excessive types and
numbers of phonological structures. In addition, since they provide for the
inclusion of a constituent between the phonological word and the phono-
logical phrase, the composite group, they account for systematic differences
in behavior between ωs and larger κ structures that also include various stray
elements. The κ constituent, moreover, yields the correct generalizations
regarding the similarities in phonological behavior of the range of elements
that it comprises.

1.5.4 Test cases: More complex structures

To further assess the composite group constituent, as well as the principles
that construct it and regulate the overall geometry of the prosodic hierarchy
within the Composite Prosody Model, we consider additional, more complex,
data in this section.
For example, the Italian items in (47) include multiple suffixes and mul-
tiple types of stray elements (i.e., level 2 prefix, clitics, substantial function

44 Irene Vogel
words), and as can be seen, the ω and κ structures account for the pertinent
phonological phenomena (i.e., Intervocalic s-​Voicing, clitic /​i/​change to [e]‌,
trisyllabic (stress) window) as effectively as they did in the simpler cases in the
previous sections.

(47) Expanded Italian structures: affixes, clitics and function words29

a. [[non]Σ [lo]σ [si]σ [ri]σ [selezion-​e-​rébbe]ω ]κ
  not itCL oneCL   re-​   would select, pe3sg   ‘one would not re-​select it’
b. [[porta]ω [me]σ [lo]σ ]κ [[la]σ [settimana]ω ]κ (prossima)
   bring   meCL itCL   the   week       (next)
‘bring it (to) me (next) week!’
c. [[porta]ω [mi]σ ]κ [[la]σ [sua]Σ [seta]ω ]κ
   bring meCL   the   his   silk      ‘bring me his silk!’

In (47a), all of the suffixes are included in the ω, and in accordance

with the trisyllabic window, stress appears on the penultimate syllable, even
though it is part of a suffix. If stress remained on the root (selezióna), it would
instead fall four syllables from the end. In addition, the two instances of /​s/,
shown in bold, fail to undergo ISV since their intervocalic contexts are not
contained within a ω, regardless of the number and nature of the stray elem-
ents involved. Note that the function words non ‘not’, and sua ‘his’ in (47c),
are analyzed as feet; the former contains a coda and thus can be considered
heavy, and the latter contains two syllables. In (47b), the rule changing /​i/​to
[e]‌in clitic sequences applies to mi ( me), but it is appropriately blocked in
(47c), where the sequence is not within the same κ. Additionally, in (47c), as
in (47a) and (47b), ISV does not apply since the intervocalic contexts in which
/​s/​appears are not contained within a ω.
English has both level 1 and level 2 prefixes and suffixes, and permits strings
of cliticized elements and other function words, thus providing the oppor-
tunity to test additional types of complex ω and κ structures. In the structure
whether he re-illegalizes them in (48), only the level 1 prefix il-​is parsed in the
ω; all the other elements are parsed in the κ (i.e., level 2 affixes and substan-
tial affixes and function words). The pronouns he and them are shown in their
reduced (cliticized) forms [i]‌and [əm], respectively.

(48) Complex English structure: affixes, clitics and function words

[[whether]Σ [i]‌σ [re]σ [il legal]ω [ize]Σ s [əm]σ]κ

Both the suffix -​ize and the function word whether are phonologically sub-
stantial, and while they resemble ωs (e.g., lexical items eyes and weather), they
are analyzed as feet here. This allows them to exhibit the relevant phono-
logical properties, including prominence on the first syllable of whether, while
reserving ω status for lexical items, which contain a morphological core.
The reduced forms of the pronouns provide additional insight into the
ω and κ, and demonstrate that they not only succeed in accounting for the

Life after the Strict Layer Hypothesis 45

distribution of the pronouns, but also crucially, they allow us to capture the
necessary generalizations, which are missed otherwise. Since [əm] is phrase
final, in any approach, it would (trivially) attach or cliticize leftward. The
[i]‌form is more problematic, however, since he is syntactically closer to the
material on its right, but at first glance, it seems to be phonologically dependent
on the word whether on its left. In fact, if whether is not present, the full form
[hi] must be used (e.g., with the verb go for simplicity: [wɛðərigoz] but [higoz]).
The Minimal Distance principle requires that whether and he, as well as
the other stray elements in (48), be parsed at the first available prosodic level,
κ, and within this constituent, the structure is flat, so that [i]‌is not phono-
logically related more closely to either whether or the following verb. Closer
examination reveals that the relationship between the presence of whether and
the use of the full form [hi] is not, in fact, a result of the phonological depend-
ence of he on whether, but rather simply a matter of the structure of the κ
itself. That is, if he is initial in the κ, [hi] must be used; if whether –​or other
material –​is present, the reduced form [i] may be used (e.g., also until [i] goes,
before [i] goes). In fact, the material preceding he need not be “substantial”,
indicating further that the appearance of the clitic form [i] is not phonolo-
gically dependent on a stronger element to its left (e.g., if [i] goes). Since the
reduced form [i] is not obligatory, and the full form [hi] is also possible (e.g.,
[wɛðərhi] goes], the choice appears to be a stylistic one that is independent of
the prosodic structure per se. The crucial prosodic information thus remains
only whether or not he is at the left edge of κ.
Other types of complexity can also be observed with compounds. In
languages like English, compounds may be very long; however, this in itself
is not problematic. If all of the components of a compound are ωs, they are
simply accommodated as such in a κ corresponding to the full lexical word
(e.g., [[ski]ω [jacket]ω [zipper]ω [factory]ω]κ), and the Compound Stress Rule
applies regularly to enhance the first member of the κ. The situation becomes
potentially more problematic when various types of stray elements (level 2
affixes, clitics, other function words) are interspersed with the ωs, as illustrated
in (49).

(49) More complex English compounds30.

a. [[writ]ω er s [cramp]ω s]κ b. [[happi]ω ness re [assessment]ω [train]ω ing]κ

In such flat κ structures, the question is how to ensure the correct appli-
cation of any relevant phonological phenomena, for example, the voicing
assimilation of the /​s/​in (49a) to [z]‌after the /​r/​of writer, and not to [s] before
the /​k/​ of cramp. In fact, no additional information is necessary. The sub-
categorization frame that indicates the PWA status of -​s also encodes its
direction of attachment as a suffix (i.e., to the right of a ω); similarly, the sub-
categorization frame for the plural -​s attaches it as a suffix following cramp,
yielding [s] after the voiceless /​p/​. By the same token, the subcategorization
frames associated with the various affixes in (49b) account for their direction

46 Irene Vogel
of phonological interaction. In both cases, the Compound Stress Rule applies
to enhance the first member of the compound (i.e., the first ω of the com-
posite group).
Finally, the parsing of stray elements directly in the composite group
avoids what have been considered “ordering” or “bracketing paradoxes” in
other models. That is, the κ’s relatively flat structure does not encode infor-
mation corresponding to the order of morpheme attachment, and thus it
does not present the opportunity for paradoxes to arise. The only type of
morphological information that is required is whether an element is a level 1
or a level 2 affix, the latter indicated by its PWA subcategorization property.
For example, in (51) and (52), it can be seen that the order of attachment of
the affixes, indicated by the level subscripts 1 and 2, is not reflected in the
corresponding κs.

(51) Ordering paradox: Interspersing of Level 1 and Level 2 affixes

a. Morphological Structure: [[[un2 [[grammat]N ical1]Adj]Adj ity1]N s2]Npl
b. Prosodic Structure: [un [grammat icál ity]ω s]κ

(52) Ordering paradox: Different morphological structures with the same elements
a. Morphological Structure 1: [un [[lock]V able]Adj]Adj (= cannot be locked)
b. Morphological Structure 2: [[un [lock]V]V able]Adj (= can be unlocked)
c. Prosodic Structure: [un [lock]ω able]κ

In (51b), the fact that un-​is not parsed in the ω depends only on its PWA
status. Since neither -​ical nor -​ity is a PWA, they both form part of the ω and
participate in its stress assignment, even if un-​ is morphologically attached
between the two. The PWA status of the plural -​s allows it to be parsed dir-
ectly in the κ, where it observes the necessary voicing assimilation pattern. The
insensitivity of the phonology to the order of morpheme attachment is seen
further in (52), where words with different internal morphological structures,
and corresponding meanings, are prosodically structured, and pronounced,
in the same way (52c).
Compounds also frequently result in ordering paradoxes, for example,
when inflections apply (morphosyntactically) to an entire compound, but
interact phonologically only with the adjacent element. As seen in the English
example in (53a), although the plural –​s pertains to the entire compound, it
is pronounced as [z]‌due to its assimilation to the directly preceding voiced
segment within the composite group. Similarly, in languages with vowel
harmony, although an inflection may pertain to an entire compound, it
participates in the harmony of the linearly adjacent material. Thus, in the
Hungarian example in (53b), while the first member of the compound has
front vowels, the rest has back vowels, including the two suffixes, which har-
monize with the directly preceding (back) root.

Life after the Strict Layer Hypothesis 47

(53) Inflection of compounds

a. English –​s assimilation
  Morphological structure: [[[tennis]N [team]N]N s]Npl
  Prosodic structure: [[tennis]ω [team]ω s]κ   (–​s = [z]‌)
b. Hungarian vowel harmony
  Morphological structure:
  [[[[élet]N [tartam]N]N ok]N ban]N
   life  span   pl  inessive    ‘in lifespans’
Prosodic structure : [[élet]ω [tartam ok ban]ω]κ

In sum, the foregoing examples demonstrate that the ω and κ constituents,

as defined within the context of the Composite Prosody Model, account for
a range of additional and more complex types of data than those discussed
in previous sections, including phenomena that are problematic in the other
models. Since the ω and κ include only the minimum necessary prosodic struc-
ture, they are relatively flat, and often quite distinct from the corresponding
morphosyntactic structures. Moreover, the explicit distinction between the
ω and κ constituents predicts both the necessary similarities and differences
among various types of elements (e.g., affixes, clitics, other function words)
that are missed when prosodic structure (at least that below the φ) is more
isomorphic to morphosyntactic structure. Thus, the assessment of both the
constituents and the restrictions on the geometry of the prosodic hierarchy
provided by the Composite Prosody Model reveals considerable success in
achieving the basic goals of linguistic theory: insightfully accounting for
attested phenomena, while excluding others that are not attested, or expected
to be attested.

1.6 Discussion: Prosodic structure geometry or geometries?

While both Match Theory and the Composite Prosody Model are intended to
provide models of the overall prosodic hierarchy, in fact, the former focuses
on phenomena operating in relation to phrasal (syntactic) structures, whereas
the latter focuses on phenomena operating in relation to morphological
structures, and various functional (syntactic) elements.31 It is thus perhaps
not surprising that the two approaches to the prosodic hierarchy appear to be
rather divergent. We must thus ask whether a single, “one size fits all” model
of the prosodic hierarchy can be maintained, or is even desirable. And if not,
what type of relationship exists between the different components of the pros-
odic hierarchy?

1.6.1 Prosodic constituents and types of phonological phenomena

Along with the difference between the nature of their interfaces, primarily
with morphology for the prosodic constituents below the phonological

48 Irene Vogel
phrase, and syntax for the higher constituents, it has been seen in previous
sections that the two types of prosodic constituents require different types
of mapping procedures. While the former must be built up from smaller to
larger elements, the latter are established on the basis of fully formed syn-
tactic structures. Moreover, the relational mapping of the lower prosodic
constituents advanced in the Composite Prosody Model excludes recursive
structures, while these may be permitted in the latter, paralleling the recursive
structures in syntax. Indeed, it appears that the higher constituents may cor-
respondingly exhibit more repetitive phenomena, for example multiple con-
stituent edge markings (e.g., Penultimate Vowel Lengthening in Xitsonga at
the right edge of ι and ι’ (Selkirk 2011)), and tonal contours spreading across
repetitions of φ and ι structures (e.g., Ladd 1986, 1996; Itô and Mester 2012;
Selkirk 2011 among others).
The difference between the syntactic interface of the higher constituents
and the morphological interface of the lower constituents is also reflected
in the presence of exceptions. While the phonological phenomena applying
in the former appear to be fully regular, the phenomena of the latter may
be more limited and exhibit idiosyncrasies or exceptions. It was seen, for
example, that in English /​n/​completely assimilates to a following /​l/​or /​r/​only
with the in-​prefix within the ω constituent, and in Italian, the rule changing
/​i/​to [e]‌applies only in certain sequences of clitics, in the κ constituent.32
Additionally, within the ω, there may be “disharmonic” patterns in vowel har-
mony languages, and more idiosyncratic pattern such as the different ways the
final /​d/​in a word such as divide surfaces when followed by different level 1
suffixes (i.e., [s]: divis-​ive; [z]: divis-​ible; [ʒ]: divis-​ion).
Finally, a rather different type of property can also be seen to distinguish
the upper and lower portions of the prosodic hierarchy, the potential effect
of extragrammatical phenomena. At the higher levels, considerations such
as speech rate and the size or weight of constituents may override the basic
prosodic constituent mapping rules, and consequently alter the domains of
application of their phonological phenomena. Indeed, Selkirk (2011) points
out that this is fairly characteristic of the ι, and not uncommon in the φ (e.g.,
Italian Raddioppiamento Sintattico (N&V), Lekeitio Basque tonal patterns
(Elordieta 1997, 2007)). The same flexibility is not, however, characteristic
of the lower ω and κ constituents, and in fact, different applications of their
phonological phenomena would most likely signal some sort of error, not
simply an alternate phrasing option.
In sum, it is clear that there are multiple fundamental differences between
the prosodic constituents below the phonological phrase and the higher
constituents. The question is whether such differences warrant essentially
two distinct prosodic hierarchies, or whether there is some way to retain a
single prosodic hierarchy. In either case, the problem that must be addressed
if we make a distinction between the “bottom up” and “top down” mapping
procedures of the different types of constituents, is how to transition from

Life after the Strict Layer Hypothesis 49

one to the other, a challenge that has been present in prosodic phonology
from the outset.

1.6.2 Unified prosodic structure geometry/​geometries

Given that the lower and upper prosodic constituents must ultimately connect
to each other, it is not clear what advantage would derive from establishing
two distinct prosodic hierarchies. Moreover, as was seen in Section 1.5.2,
there are fundamental properties (Proper Headedness, Minimal Distance, and
Constituent Sequencing) that apply to all constituents, including the non-​
interface constituents, something that would not be expected if the various
constituents did not form part of a single, unified prosodic hierarchy.
The same general principles can also be seen to apply at the interfaces
between the groups of prosodic constituents. That is, Proper Headedness
requires that the phonological phrase contain a constituent of the next lower
level as its head, here the composite group, and similarly that the phonological
word contain a foot as its head. Minimal Distance connects stray segments,
syllables, and feet to the first available (interface) constituent, the ω, and if that
is not available, to the next constituent, the κ. By the same token, κs must be
parsed at the φ level, rather than some higher level, even if they contain clitics
or other function words that would syntactically be associated with higher,
or different, structures. Finally, with regard to Constituent Sequencing, the
lower constituent groupings are nested within the higher ones: non-​interface
constituents < lower (morphology) interface constituents < higher (syntax)
interface constituents.
The Composite Prosody Model provides a more nuanced theory of the
prosodic hierarchy that allows us to accommodate both the fundamental
differences between the two types of interfaces constituents and the non-​
interface constituents, as well as the general principles that apply across all
of the constituents. That is, while it recognizes an internal tripartite division
among the prosodic constituents based on the nature of the interface with
other components of grammar, it unifies all of the levels by imposing the
same restrictions on the overall architecture of the prosodic hierarchy, as
shown schematically in Figure 1.2.
As can be seen in Figure 1.2, the composite group plays a crucial role in
the prosodic hierarchy, serving as the conjunction of the morphological and
syntactic interfaces, where the transition takes place between the building
up of prosodic constituents from smaller to larger via a relational mapping
procedure, and a more direct mapping from syntactic structures to prosodic
The stitching together of the different portions of the prosodic hierarchy
is accomplished as phonological phrases determine which lexical items they
must include, but not the prosodic structures associated with these items.
While the lexical items comprise one or more phonological words, the ωs may
only include a portion of a lexical item, the root and level 1 affixes, but not

50 Irene Vogel

Intonational Phrase ( , ’, ( ))


syntax to
Syntax Interface
Phonological Phrase ( , ’ )
Composite Group ( )
Morphology |
Interface Phonological Word ( )

elements to larger

Mapping: small

Foot ( )
No interface Syllable ( )

Figure 1.2 Composite Prosody Model with tripartite prosodic hierarchy

Match Clause  Intonational Phrase

Match Phrase  Phonological Phrase … x …

Composite Group => Phonological [ ] 1 [… x] 2 1 [ … ]3 2

Words, Stray Elements

Figure 1.3 Transition between upper and lower interface constituents in prosodic


level 2 affixes. In order for a φ to include full lexical items (with level 2 affixes),
it must be composed of the prosodic constituents that incorporate all of the
affixes, composite groups. Since the κs also parse stray functional elements, all
of the material in a given κ will be parsed in the corresponding φ. Where there
are “rough edges”, cases where bits of the lower prosodic constituents do not
align with the domains delimited by the syntax, it is the lower constituents
that prevail. That is, clitics and other functional elements come along in the
composite groups that have been built up by the relational mapping pro-
cedure, even if they are not consistent with the syntactic parsing, as in the case
of directional clitics. The transition between the lower and upper portions of
the prosodic hierarchy is illustrated schematically in Figure 1.3.
In Figure 1.3, an intonational phrase dominates two phonological phrases,
φ1 and φ2, as determined by a syntactic mapping procedure. Each φ includes
at least one lexical item, which consists of a ω and any associated material,
grouped into a κ. The shaded “x” (e.g., a directional clitic) is syntactically
part of the phrase corresponding to φ2, but since it does not interact phono-
logically with this phrase, but rather with the element to its left, it forms part

Life after the Strict Layer Hypothesis 51

of the composite group with this element, κ2. Thus, the relational parsing
of material up to clitics and other functional elements accounts for their
prosodic placement, and consequently their prosodic phonological behavior.
Without the composite group mediating between the phonological word
and the phonological phrase, it is not possible to account for the fact that
phonological words do not necessarily include all of their affixes, and that
not all functional elements interact phonologically with their closest syntactic
element. In both cases, the necessary prosodic “allegiance” of the material is
determined by its presence in the relevant composite group.

1.7 Conclusions
This chapter has considered a number of modifications of the original model
of prosodic phonology, as articulated in Nespor and Vogel (1986). Consistent
across the proposals and theoretical perspectives is the recognition that the
SLH, and specifically the principle of strict dominance, was too restrictive
and thus needed to be weakened. Since relaxing the restrictions on any type of
system, by definition, gives rise to previously excluded options, it creates new
challenges of determining whether all of the additionally permitted options
are desirable, and if not, what other types of restrictions must be instituted to
appropriately constrain the system.
As was demonstrated, weakening the SLH results in considerable
overgeneration of prosodic structure configurations. Aside from the sheer
number of possibilities, which is at least intuitively implausible, the add-
itional options result in incorrect predictions about the types of phonological
patterns that will be observed in languages, as well as the loss of generalizations
among the phenomena within a given language. Three types of approaches to
counteract the problems that arise from weakening the SLH have thus been
examined: Selkirk’s (2011) Match Theory, Itô and Mester’s (e.g., 2009a, b)
Adjunction Approach, and the Composite Prosody Model advanced here.
It was demonstrated that while Match Theory highly restricts the mapping
relations between the morphosyntax and phonology, returning to a model in
which prosodic constituents closely mirror syntactic structures, it also permits
a vast increase in the internal configurations of the prosodic constituents. In
particular, it allows stray elements to be parsed at any level of the prosodic
hierarchy, and it includes recursive constituents at all three of the levels it
recognizes: phonological word, phonological phrase, and intonational phrase.
Aside from the large number of additional prosodic configurations, the recur-
sive constituents were also shown to introduce a number of problems with
regard to the nature and definition of both recursion and the constituents
themselves. The reliance on syntax in constructing the phonological word,
moreover, was shown to obscure the well-​established phonological differences
between level 1 and level 2 affixes.
The Adjunction Approach adopts the same approach as Match Theory
for the mapping of the phonological phrase and intonational phrase; how-
ever, differently from Match Theory, it substantially limits the possible

52 Irene Vogel
prosodic configurations by restricting the appearance of stray elements to
below the phonological phrase. It also crucially differs from Match Theory
in distinguishing between level 1 and level 2 affixes. The inclusion of recur-
sive constituents, however, introduces the same types of problems with the
definitions of the prosodic constituents and recursion that arise in Match
Differently from both Match Theory and the Adjunction Approach, the
Composite Prosody Model advanced here crucially includes a prosodic con-
stituent between the phonological word and the phonological phrase, the
composite group. As was demonstrated, this constituent not only provides
the necessary domain to straightforwardly account for a range of phenomena
across languages that are problematic in the other models, but it also allows
us to avoid recursion, at least below the phonological phrase, and thus the
various drawbacks that accompany recursive constituents.
In parsing together a number of different stray elements (i.e., level 2
affixes, clitics, other function words), as well as compounds, the composite
group correctly predicts similarities in their phonological behavior, as dis-
tinct from those of phonological words, on the one hand, and phonological
phrases, on the other hand. It was shown, furthermore, that parsing the
stray elements in the composite group also effectively limits the possible
prosodic configurations, since the elements in question may not appear else-
where in prosodic structures. This, in turn, substantially limits the range of
phonological structures and phenomena predicted to be possible in human
Examination of a number of fundamental distinctions between the pros-
odic constituents that interface primarily with morphology and those that
interface with syntax, at first glance appeared to suggest that there may, in
effect, be different prosodic hierarchies for the two types of constituents.
While the syntax-​interface constituents seem to closely mirror the structures
from which they are mapped, possibly including recursion, the morphology-​
interface constituents may diverge substantially from the corresponding
morphological (and syntactic) structures from which they are derived, via a
relational mapping. The phonological phenomena that apply in the former,
moreover, appear to be exceptionless, while the phenomena associated with
the latter often exhibit limitations and idiosyncrasies. The former also appear
to be subject to extragrammatical considerations such as their weight or size
and rate of speech, while the latter are not.
While the simplest model of phonological interfaces would certainly be
one in which the same mapping principles apply consistently at all levels,
the fact that there are different constellations of phenomena in different
portions of the prosodic hierarchy indicates that such simplicity is not ten-
able. The Composite Prosody Model offers a more nuanced view of the pros-
odic hierarchy that features a tripartite structure, with distinct properties
associated with each of the different types of constituents, those interfacing
primarily with syntax or with morphology, and those that do not interface

Life after the Strict Layer Hypothesis 53

with other components of grammar. It additionally provides a means of
unifying the various prosodic levels via a small set of principles that restrict
the overall architecture of the hierarchy. Specifically, Proper Headedness,
Minimal Distance, and Constituent Sequencing strictly limit possible prosodic
structures, and their related phenomena, by allowing levels to be skipped in
the prosodic hierarchy, but at the same time, restricting the occurrence of
stray elements to below the phonological phrase. Moreover, recursion is in
principle excluded, although the option is left open that it may be required in
the higher syntax-​interface constituents.
Finally, it was demonstrated that the Composite Prosody Model offers
a means of addressing the long-​standing problem of bridging between the
lower constituents that cannot be mapped directly from morphosyntactic
structures, and the higher constituents that rely more directly on syntactic
structures. Crucially, the composite group serves as a type of transition con-
stituent, being built up from smaller to larger elements, but also interfacing
with the syntactic structures by defining the phonological domains that are
associated with, but not necessarily identical to, the lexical items that are
parsed within phonological phrase constituents.
Of course, the bottom line is the correct and insightful account of attested
phonological phenomena, as well as the determination of what types of
phonological structures and phenomena are expected to be attested, and
which should be excluded. In comparison with other prosodic models, it has
been demonstrated that the Composite Prosody Model offers substantial
advantages in both regards.

1 I am grateful for the discussion and comments I received on an earlier version
of this chapter from the participants at the First International Conference on
Prosodic Studies: Challenges and Prospects (Tianjin, China; June 2015). Of
course, all shortcomings are my own.
2 The body of research on the Prosodic Hierarchy is by now quite vast. It is not the
intention to provide a review of this body of literature here, but only to highlight
some core issues and representative works as they pertain to the questions under
investigation. Other recent publications offer detailed background, summaries,
and analyses of various aspects of Prosodic Phonology. For a particularly thor-
ough discussion, see Scheer (2010).
3 Some analyses have also argued for direct reference to syntactic constituents
(among others, Cinque 1993; Kaisse 1985; Odden 1987, 1996, 2000); however,
given the existence of numerous phenomena that clearly do not apply in syntactic
domains, a model that uniquely relies on syntax cannot be adequate. Selkirk’s
(2011) recent Match Theory returns to a more direct reliance on syntax, but
nevertheless leaves some room for differences between syntactic and phonological
structures, as will be discussed below.
4 The abbreviation for the Clitic Group was just “C” in N&V. The mora was not
included in N&V, but it has subsequently been included in some hierarchies since

54 Irene Vogel
it consists of structure beyond a single segment and participates in prosodic
phonological phenomena (e.g., stress assignment, tonal patterns).
5 The listing of variations here is only meant to be illustrative. For recent summaries
and discussions of the developments in Prosodic Phonology, the reader is referred
to Scheer (2010), Dehé et al. (2011), and Selkirk (2011), among others.
6 Note, however, that the original Phonological Utterance could include more than
one sentence (e.g., N&V; Vogel 1986), although this is not possible in a model such
as Match Theory.
7 Throughout this chapter, phonological phenomena are usually referred to as rules,
and derivational-​type formulations are used to represent them. This is done as
a matter of expediency since such formulations tend to be descriptively simple
and clear; it is not intended as an argument for this type of approach over some
other type.
8 For simplicity, the Clitic Group is omitted here and elsewhere unless it is crucial
for a given discussion. It is not, however, the intention to ultimately exclude such
a constituent from the Prosodic Hierarchy, as will be seen below.
9 Such an argument is of course not relevant to approaches that do not assume a
universal set of constituents (e.g., Schiering et al. 2010).
10 The form si has several functions in Italian, so there could be more than one trans-
lation for examples using this element here and below. In each case, one possible
translation is provided.
11 The rule is stated informally here, but in fact, it also applies in the presence of
glides (i.e., [-​cons] segments).
12 This independence is recognized in constraint-​based analyses that include sep-
arate constraints and rankings pertaining to skipping levels and recursion (among
others, Itô and Mester 1992, and other publications; Selkirk 1996, and other
publications; Truckenbrodt 1999).
13 The parsing of /​s/​directly into the ω, furthermore, allows it to remain available
for syllabification as the coda of a preceding word as needed (e.g., following a
stressed vowel as in tre sfide [trés.fí.de] ‘three challenges’). (See among others Vogel
1977, 1982.)
14 The Xitsonga examples presented in Selkirk (2011), and discussed elsewhere this
chapter, are based on material derived from Kisseberth’s (1994) original analysis
of the language.
15 See also van der Hulst (2010) for a somewhat different view of syllable-​internal
recursion, as well as a general discussion of recursion at different phonological
16 The prime diacritic is used here to show mora recursion, although in the literature
it is less commonly used for moras than for other recursive constituents.
17 This refers to so-​called “non-​cohering” affixes; differences between affix types are
discussed in more detail below.
18 Note that the enhancement is primarily perceptual, the effect being caused by a
reduction of the prominence of the other elements. There are also some different
stress patterns in compounds (e.g., Plag et al. 2008), but this does not alter the
main point here.
19 The treatment of marginal or more limited phenomena is interesting in its own
right; see among others Simon and Weise (2011) and Inkelas (2014).
20 There may also be cases where a higher constituent directly dominates a mora.

Life after the Strict Layer Hypothesis 55

21 Note that since Match Theory does not distinguish between the two types of
affixes (i.e., ω corresponds to a lexical item regardless of its internal structure), it
is not subject to the problems discussed here.
22 A handful of Italian verbs deviate from this pattern, exhibiting pre-​antepenultimate
stress in certain third person plural forms (e.g., teléfonano ‘(they) telephone’);
however, these are rare and considered to be exceptional.
23 Some Italian dialects differ with respect to the role of clitics in Stress Assignment
(e.g., Peperkamp 1997); however, the patterns do not completely replicate those seen
within the internal ω and thus must be accounted for differently (e.g., Vogel 2009).
24 “dim” = diminutive suffix.
25 See Itô and Mester (2007) for an earlier proposal to reduce the prosodic hierarchy
to three interface categories: intonation group, phrase, word.
26 “ʊ” and “κ” represent the Phonological Utterance and the Composite Group,
27 Since Itô and Mester (e.g., 2003, 2009a) provide cases in which Maximal Parsing
yields correct results, it remains to be determined whether the same results can
be obtained with the simpler Principle of Minimal Distance, in conjunction with
the other aspects of the Composite Prosody Model. If not, adjustments must be
made, or different constraint rankings might be invoked for the different cases.
28 Vigário (2011) includes a similar constituent, which she calls the “Prosodic Word
Group”, in her analysis of Portuguese compounds. This closely parallels the
Composite Group analysis presented in Vogel (e.g., 2009, 2010); however, differ-
ently from the κ, the Prosodic Word Group does not also include stray elements.
29 The function word in (47c), sua ‘his’ has two syllables and thus also constitutes a
Foot. Although the structure in (47a) is somewhat contrived, it was found to be
acceptable by two native speakers of Italian.
30 For simplicity, the prosodic constituents of the stray elements (syllables, feet) are
not shown.
31 The Adjunction Approach focuses on the same types of phenomena as the
Composite Prosody Model, but as discussed in previous sections, it encounters
problems associated with its use of recursive structures.
32 In fact, this rule is even more selective and may not apply, for example, with the
clitic ci ‘there’, as in ci si compra (not *ce si compra) ‘one buys there’.

Anderson, S. (2005) Aspects of the theory of clitics. Oxford: Oxford University Press.
Antilla, A. (2002) “Morphologically conditioned phonological alternations”, NLLT,
20, pp. 1–​42.
Basbøl, H. (1975) “Grammatical boundaries in phonology”, Aripuc, 9, pp. 109–​135.
Basbøl, H. (1981) “On the function of boundaries in phonological rules” in Goyvaerts,
D. (ed.) Phonology in the 1980’s. Ghent: Story-​Scientia, pp. 245–​269.
Beckman, M. E., and Ayers, G. M. (1994) Guidelines for ToBI labelling. Online MS
and accompanying files. www.ling.ohio-​​phonetics/​E_​ToBI
Beckman, M. E., and Hirschberg, J. (1994) The ToBI annotation conventions. Online
MS. www.ling.ohio-​​~tobi/​ame_​tobi/​annotation_​conventions.html
Beckman, M., and Pierrehumbert, J. (1986) “Intonational structure in English and
Japanese”, Phonology Yearbook, 3, pp. 255–​310.

56 Irene Vogel
Bertinetto, P. M. (1999) “Boundary strength and linguistic ecology”, Folia Linguistica,
33, pp. 267–​286.
Bickel, B., Hildebrandt, K., and Schiering, R. (2009) “The distribution of phono-
logical word domains” in Grijzenhout, J., and Kabak, B. (eds.) Phonological
domains: Universals and deviations. Berlin: Mouton de Gruyter, pp. 47–​75.
Booij, G. (1985) “Coordination reduction in complex words: A case for prosodic
phonology” in Hulst, H. van der, and Smith, N. (eds.) Advances in non-​linear phon-
ology. Dordrecht: Foris, pp. 143–​160.
Booij, G. (1996) “Cliticization as prosodic integration: The case of Dutch”, Linguistic
Review, 13, pp. 219–​242.
Booij, G. (1999) “The role of the prosodic word in phonotactic generalizations” in
Hall, T. A., and Kleinhenz, U. (eds.) Studies on the phonological word. Philadelphia,
PA: John Benjamins, pp. 47–​72.
Booij, G. (2007 [2005]) The grammar of words. Oxford: Oxford University Press.
Carter, R. T. Jr. (1974) Teton Dakota phonology. Ph.D. Diss., University of New
Mexico. (Published as University of Manitoba Anthropology Papers 10.)
Chomsky, N., and Halle, M. (1968) Sound pattern of English. Cambridge, MA:
MIT Press.
Cinque, G. (1993) “A null theory of phrase and compound stress”, Linguistic Inquiry,
24, pp. 239–​297.
Czaykowska-​Higgins, E., and Kinkade, M. D. (eds.) (1998) Salish languages and lin-
guistics: Theoretical and descriptive perspectives. Berlin: Mouton De Gruyter.
Dehé, N., Feldhausen, I., and Ishihara, S. (2011) “The prosody–​syntax interface: Focus,
phrasing, language evolution”, Lingua, 121(13), pp. 163–​169.
Dixon, R. M. W., and Aikhenvald, A. Y. (2002) “Word: A typological framework” in
Dixon, R. M.W., and Aikhenvald, A. Y. (eds.) Word. Cambridge: University Press,
pp. 1–​41.
Downing, L. J. (1999) “Prosodic stem ≠ prosodic word in Bantu” in Hall, T. A., and
Kleinhenz, U. (eds.) Studies on the phonological word. Philadelphia, PA: John
Benjamins, pp. 73–​98.
Elfner, E. (2012) Syntax-Prosody Interactions in Irish. PhD Dissertation. University of
Elordieta, G. (1997) “Accent, tone and intonation in Lekeitio Basque” in Martínez-​
Giland, F., and Morales-​Front, A. (eds.) Issues in the phonology and morphology
of the major Iberian languages. Washington, DC: Georgetown University Press,
pp. 4–​78.
Elordieta, G. (2007) “Minimum size constraints on intermediate phrases” in
Proceedings of the 16th International Congress of Phonetic Sciences, Saarbrucken,
pp. 1021–​1024.
Gussenhoven, C. (2004) The phonology of tone and intonation. Cambridge: Cambridge
University Press.
Gussenhoven, C. (2005) “Procliticized phonological phrases in English: Evidence from
rhythm”, Studia Linguistica, 59, pp. 174–​193.
Haider, H. (1993) Deutsche syntax, generativ. Tübingen: Gunter Narr.
Hall, T. A. (1999) “The phonological word: a review” in Hall, T. A., and Kleinhenz,
U. (eds.) Studies on the phonological word. Philadelphia, PA: John Benjamins,
pp. 1–​22.
Hayes, B. (1989) “The prosodic hierarchy in meter” in Kiparsky, P., and Youmans, G.
(eds.) Rhythm and meter. Orlando, FL: Academic Press, pp. 201–​260.

Life after the Strict Layer Hypothesis 57

Hayes, B. (1995) Metrical stress theory: Principles and case studies. Chicago: University
of Chicago Press.
Hulst, H. van der (2010) A note on recursion in phonology. In Hulst, H. van der (ed.)
Recursion and human language. Berlin: Mouton de Gruyter, pp. 301–​342.
Inkelas, S. (2014) The interplay of morphology and phonology. Oxford: Oxford
University Press.
Inkelas, S., and Orgun, C. O. (1998) “Level (non)ordering in recursive morph-
ology: Evidence from Turkish” in Lapointe, S. G., Brentari, D. K., and Farrell, P.
M. (eds.) Morphology and its relation to phonology and syntax. Stanford, CA: CSLI,
pp. 360–​410.
Inkelas, S., and Zoll, C. (2005) Reduplication: Doubling in morphology. Cambridge:
Cambridge University Press.
Itô, J., and Mester, A. (1992) “Weak layering and word binarity”. MS, University of
California Santa Cruz, Linguistic Research Center.
Itô, J., and Mester, A. (2003) “Weak layering and word binarity” in Honma, C,
Okazaki, M., Tabata, T., and Tanaka, S. (eds.) A new century of phonology and
phonological theory: A festschrift for Professor Shosuke Haraguchi on the occasion
of his sixtieth birthday. Tokyo: Kaitakusha, pp. 26–​65.
Itô, J., and Mester, A. (2007) “Prosodic adjunction in Japanese compounds” in
Miyamoto, Y., and Ochi, M. (eds.) Formal approaches to Japanese Linguistics:
Proceedings of FAJL 4. Cambridge, MA: MIT Department of Linguistics and
Philosophy, pp. 97–111.
Itô, J., and Mester, A. (2009a) “The extended prosodic word” in Grijzenhout, J., and
Kabak, B. (eds.) Phonological domains: Universals and deviations. Berlin: Mouton
de Gruyter, pp. 135–​194.
Itô, J., and Mester, A. (2009b) “The onset of the prosodic word” in Parker, S. (ed.)
Phonological argumentation: Essays on evidence and motivation. London: Equinox,
pp. 227–​260.
Itô, J., and Mester, A. (2012) “Recursive prosodic phrasing in Japanese” in Borowsky,
T., Kawahara, S., Shinya, T., and Sugahara, M. (eds.) Prosody matters: Essays in
honor of Elisabeth Selkirk. London: Equinox, pp. 280–​303.
Itô, J., and Mester, A. (2013) “Prosodic sub-​categories in Japanese”, Lingua, 124,
pp. 2–​40.
Jones, P. (2011) “New evidence for a phonological stem domain in Kinande”,
Proceedings of WCCFL, 28, pp. 285–​293.
Jun, S.-​A. (1998) “The accentual phrase in the Korean prosodic hierarchy”, Phonology,
15, pp. 189–​226.
Jun, S.-​A. (2005a) “Prosodic typology” in Jun, S.-​A. (ed.) Prosodic typology: The phon-
ology of intonation and phrasing. Oxford: Oxford University Press, pp. 430–​458.
Jun, S.-​A. (ed.) (2005b) Prosodic typology: The phonology of intonation and phrasing.
Oxford: Oxford University Press.
Jun, S.-​A. (ed.) (2014) Prosodic typology II: The phonology of intonation and phrasing.
Oxford: Oxford University Press.
Kabak, B., and Schiering, R. (2006) “The phonology and morphology of function
word contractions in German”, Journal of Comparative Germanic Linguistics, 9,
pp. 53–​99.
Kabak, B., and Vogel, I. (2001) “Stress in Turkish”, Phonology, 18(3), pp. 315–360.
Kahn, D. (1976) Syllable-​based generalizations in English phonology. Ph.D. Diss.,
Massachusetts Institute of Technology.

58 Irene Vogel
Kaisse, E. (1985) Connected speech: The interaction of syntax and phonology. New York:
Academic Press.
Kaisse, E. M., and Shaw, P. (1985) “On the theory of lexical phonology”, Phonology
Yearbook, 2, pp. 1–​30.
Kanerva, J. (1990) “Focusing on phonological phrases in Chichewa” in Inkelas, S., and
Zec, D. (eds.) The phonology-​syntax connection. Chicago: University of Chicago
Press, pp. 145–​161.
Kisseberth, C. (1994) “On domains” in Cole, J., and Kisseberth, C. (eds.) Perspectives
in phonology. Stanford, CA: CSLI, pp. 133–​166.
Klavans, J. (1982) Some problems in a theory of clitics. Ph.D. Diss., University College
Klavans, J. (1985) “The independence of syntax and phonology in cliticization”,
Language, 61, pp. 95–​120.
Ladd, D. R. (1986) “Intonational phrasing: The case for recursive prosodic structure”,
Phonology, 3, pp. 311–​340.
Ladd, D. R. (1996/​2008) Intonational phonology. Cambridge Studies in Linguistics 79.
Cambridge: Cambridge University Press.
Loporcaro, M. (1999) “Teoria fonologica e ricerca empirica sull’italiano e i suoi
dialetti. Fonologia e morfologia dell’italiano e dui dialetti d’Italia” in Benincà, P.,
Mioni, A., and Vanelli, L. (eds.) Atti del 31º Congresso della Società di Linguistica
Italiana. Roma: Bulzoni, pp. 117–​151.
McCarthy, J. J. (1979) “On stress and syllabification”, Linguistic Inquiry, 10, pp.
Nespor, M., and Vogel, I. (1982) “Prosodic domains of external sandhi rules” in Hulst,
H. van der, and Smith, N. (eds.) The structure of phonological representations.
Dordrecht: Foris, pp. 224–​255.
Nespor, M., and Vogel, I. (1986/​2007) Prosodic phonology. Dordrecht: Foris.
Odden, D. (1987) “Kimatuumbi phrasal phonology”, Phonology Yearbook, 4,
pp. 13–​36.
Odden, D. (1996) The phonology and morphology of Kimatuumbi. The Phonology of
the World’s Languages. Oxford: Clarendon Press.
Odden, D. (2000) “The phrasal tonology of Zinza”, Journal of African Languages and
Linguistics, 21, pp. 45–​75.
Patterson, T. A. (1990) Theoretical aspects of Dakota morphology and phonology.
Ph.D. Diss., University of Illinois Urbana-​Champaign.
Peperkamp, S. (1997) Prosodic words. HIL Dissertation Series 34. The Hague: Holland
Academic Graphics.
Plag, I., Kunter, G., Lappe, S., and Braun, M. (2008) “The role of semantics, argument
structure, and lexicalization in compound stress assignment in English”, Language
84, pp. 760–​794.
Scheer, T. (2010) A guide to morphosyntax-​phonology theories: How extra-​phonological
information is treated in phonology since Trubetzkoy’s Grenzsignale. Berlin: De
Gruyter Mouton.
Schiering, R., Bickel, B., and Hildebrandt, K. (2010) “The prosodic word is not uni-
versal, but emergent”, Journal of Linguistics, 46(3), pp. 657–​709.
Schiering, R., Hildebrandt, K., and Bickel, B. (2007) “Cross-​linguistic challenges
for the prosodic hierarchy: Evidence from word domains”. MS, University of
Selkirk, E. (1972) The phrase phonology of English and French. Outstanding
Dissertations in Linguistics. New York: Garland Publishing.

Life after the Strict Layer Hypothesis 59

Selkirk, E. (1978/​ 1981a) “On prosodic structure and its relation to syntactic
structure” in Fretheim, T. (ed.) Nordic prosody II: Papers from a symposium.
Trondheim: TAPIR, pp. 111–​140.
Selkirk, E. (1980a) “Prosodic domains in phonology: Sanskrit revisited” in Aronoff,
M., and Kean, M. (eds.) Juncture. Saratoga: Anma Libri, pp. 107–​129.
Selkirk, E. (1980b) “The role of prosodic categories in English word stress”, Linguistic
Inquiry, 11, pp. 563–​605.
Selkirk, E. (1984) Phonology and syntax: The relation between sound and structure.
Cambridge, MA: MIT Press.
Selkirk, E. (1986) “On derived domains in sentence phonology”, Phonology Yearbook,
3, pp. 371–​405.
Selkirk, E. (1995) “Sentence prosody: Intonation, stress and phrasing” in Goldsmith,
J. A. (ed.) The handbook of phonological theory. Cambridge, MA: Blackwell,
pp. 550–​569.
Selkirk, E. (1996) “The prosodic structure of function words” in Morgan, J. L., and
Demuth, K. (eds.) Signal to syntax: Prosodic bootstrapping from speech to grammar
in early acquisition. Mahwah, NJ: Lawrence Erlbaum Associates, pp. 187–​214.
Selkirk, E. (2011) “The phonology-​syntax interface” in Goldsmith, J., Riggle, J.,
and Yu, A. (eds.) The handbook of phonological theory, 2nd edition. Oxford:
Blackwell, pp. 435–485.
Selkirk, E., and Tateishi, K. (1988) “Constraints on minor phrase formation in
Japanese” in Larson, M. G. and Brentari, D. (eds.) Proceedings of the 24th Annual
Meeting of the Chicago Linguistics Society. Chicago: Chicago Linguistics Society,
pp. 316–​336.
Selkirk, E., and Tateishi, K. (1991) “Syntax and downstep in Japanese” in Georgopoulos,
C. and Ishihara, R. (eds.) Interdisciplinary approaches to language: Essays in honor
of S.-​Y. Kuroda, Dordrecht: Kluwer, pp. 519–​543.
Selkirk, E., Shinya, T., and Sugahara, M. (2003) “Degree of initial lowering in
Japanese as a reflex of prosodic structure organization” in Proceedings of the 15th
International Congress of Phonetic Sciences. Barcelona, pp. 491–494.
Shaw, P. (1980) Theoretical issues in Dakota phonology and morphology. New York:
Garland Press.
Shaw, P. (1985) “Modularization and substantive constraints in Dakota lexical phon-
ology”, Phonology Yearbook, 2, pp. 173–​202.
Shinya, T., Selkirk, E., and Kawahara, S. (2004) “Rhythmic boost and recursive minor
phrase in Japanese” in Proceedings of the Second International Conference on
Speech Prosody. Nara, Japan, pp. 183–​186.
Simon, H. J., and Wiese, H. (eds.) (2011) Expecting the unexpected: Exceptions in
grammar. Trends Linguistics: Studies and Monographs Series. Amsterdam: Mouton
de Gruyter.
Truckenbrodt, H. (1999) “On the relation between syntactic phrases and phonological
phrases”, Linguistic Inquiry, 30, pp. 219–​256.
Venditti, J. (2005) “The JToBI model of Japanese intonation” in Jun, S.-​A. (ed.) Prosodic
typology: The phonology of intonation and phrasing. Oxford, New York: Oxford
University Press, pp. 172–​200.
Vigário, M. (2003) The prosodic word in European Portuguese. Berlin: Mouton de
Vigário, M. (2011) “Prosodic structure between the prosodic word and the phono-
logical phrase: Recursive nodes or an independent domain?”, Linguistic Review,
27(4), pp. 485–​530.

60 Irene Vogel
Vogel, I. (1977) The syllable in phonological theory: With special reference to Italian.
Ph.D. Diss., Stanford University.
Vogel, I. (1982) La Sillaba come Unità Fonologica. [The syllable as phonological unit].
Bologna: Zanichelli.
Vogel, I. (1986) “External sandhi rules operating between sentences” in Andersen, H.
(ed.) Sandhi phenomena in the languages of Europe. Berlin: Mouton de Gruyter,
pp. 55–​64.
Vogel, I. (1999) “Subminimal constituents in prosodic phonology” in Hannahs, S. J.,
and Davenport, M. (eds.) Phonological structure. Dordrecht: Foris, pp. 251–​269.
Vogel, I. (2008a) “The morphology-​phonology interface: Isolating to polysynthetic
languages”, Acta Linguistica Hungarica, Special issue, 55(1), pp. 1–​22.
Vogel, I. (2008b) “Universals of prosodic structure” in Scalise, S., Magni, E., Vineis,
E., and Bisetto, A. (eds.) Universals of language today. Amsterdam: Springer,
pp. 59–​82.
Vogel, I. (2009) “The status of the Clitic Group” in Grijzenhout, J., and Kabak, B.
(eds.) Phonological domains: Universals and deviations. Berlin: Mouton de Gruyter,
pp. 15–​46.
Vogel, I. (2010) “The phonology of compounding” in Scalise, S., and Vogel, I. (eds.)
Compounding: Theory and analysis. Amsterdam: John Benjamins, pp. 145–​163.
Vogel, I. (2012) “Recursion in phonology?” in Bert, B., and Noske, R. (eds.)
Phonological explorations: Empirical, theoretical and diachronic issues. Berlin/​
Boston: De Gruyter, pp. 41–​61.
Vogel, I., and Raimy, E. (2002) “The acquisition of compound vs. phrasal stress in
English”, Journal of Child Language, 29(2), pp. 225–​250.
Watson, J. C. E. (2002) The phonology and morphology of Arabic. Oxford: Oxford
University Press.
Watson, J. C. E. (2011) “Word stress in Arabic” in van Oosterdorp, M., Ewen, C.
J., Hume, E. V., and Rice, K. (eds.) Blackwell companion to phonology, vol.
5. Oxford: Wiley-​Blackwell, pp. 2990–​3019.
Wiese, R. (1996) The phonology of German. Oxford: Clarendon Press.
Zwicky, A. (1984) “Clitics and particles”, Ohio State Working Papers in Linguistics,
29, pp. 148–​173.

The Revised Max Onset
Syllabification and stress in English
San Duanmu

2.1 Syllabification and syllable weight

A typical syllable contains a main vowel, or the nucleus. The part before
the nucleus is the onset and the part after the nucleus is the coda. The part
consisting of the nucleus and the coda is also called the rime, and the part
consisting of the onset and the nucleus is called the ‘body’ (Vennemann 1988).
The terms are illustrated in (1).

(1) Onset, nucleus, coda, body, and rime of a syllable

Word Onset Vowel Coda Body Rime
[​prɪnt]​print [​pr]​ [​ɪ]​ [​nt]​ [​prɪ]​ [​ɪnt]​
[​sɪt]​sit [​s]​ [​ɪ]​ [​t]​ [​sɪ]​ [​ɪt]​
[​ɪt]​it none [​ɪ]​ [​t]​ [​ɪ]​ [​ɪt]​
[​ðə]​the [​ð]​ [​ə]​ none [​ðə]​ [​ə]​

Syllabification is a procedure that groups the sounds (consonants and

vowels) of a word into syllables. There are different theories of syllabification.
Consider the English word extra, which can be syllabified in different ways,
shown in (2), where brackets represent syllable boundaries.

(2) Different ways to syllabify extra /​ɛkstrə/​

Syllabification Proponent Assumptions
a. [ɛkstr][ə] None Possible onset
b. [ɛkst][rə] Hoard (1971) Max Stressed Onset, Max Coda
c. [ɛks][trə] Lowenstamm (1981) Max Onset, Sonority
d. [ɛk][strə] Pulgram (1970) Max Onset
e. [ɛ][kstrə] None Possible coda

62 San Duanmu
It is generally agreed that every syllable should have a possible onset and a
possible coda, to be specified shortly. Thus, no analysis proposes (2a), because
[kstr] is not a possible coda. Similarly, no analysis proposes (2e), because [kstr]
is not a possible onset. But opinions differ on how to create possible onsets
and codas, as seen in (2c)–​(2d).
Analysis (2b) is proposed by Hoard (1971), based on two assumed
requirements: (i) the onset of a stressed syllable should be maximized (Max
Stressed Onset) and (ii) the coda should be maximized (Max Coda). In extra,
the second syllable has no stress, which means it need not maximize its onset,
and so the first syllable takes all the consonants it can as its coda, leaving only
/​r/​to the second syllable. A similar analysis is proposed by Bailey (1978) and
Wells (1990).
Analysis (2c) is proposed by Lowenstamm (1981), who assumes that the
onset should be maximized for all syllables (Max Onset), plus the require-
ment that consonants in the onset should have increasing sonority. Following
Jespersen (1904), Lowenstamm assumes the sonority scale ‘vowel > glide >
sonorant > fricative > stop’, where a vowel has the greatest sonority and a
stop has the least. According to the scale, the sequence /​st/​does not have
increasing sonority; therefore, /​st/​cannot fit into an onset but must split
between two syllables, as shown.
(2d) is proposed by Pulgram (1970), who also assumes Max Onset for all
syllables but without the sonority requirement. Thus, the onset of the second
syllable is [str].
Let us consider another example. The English word whiskey /​wɪski/​has four
proposed analyses, shown in (3). In (3c), [s]‌is ‘ambisyllabic’, which means it
belongs to both the first syllable and the second, so that the first syllable is
[wɪs] and the second is [ski].

(3) Four ways to syllabify whiskey /​wɪski/​

Analysis Proponent Assumptions
a. [wɪ][ski] Halle and Vergnaud (1987) Max Onset
b. [wɪsk][i]‌ Hoard (1971) Max Stressed Onset, Max Coda
c. [wɪ[s]‌ki] Kahn (1976) Max Onset, ambisyllabic rule
d. [wɪs][ki] Pulgram (1970) Max Onset, possible rime

Analysis (3a) is proposed by Halle and Vergnaud (1987), based on Max

Onset. Analysis (3b) is proposed by Hoard (1971), based on Max Stressed
Onset and Max Coda, as discussed above. Analysis (3c) is proposed by Kahn
(1976), based Max Onset first, and followed by an ‘ambisyllabic’ rule that
allows a stressed vowel to use the following consonant as its coda, even if
the consonant is already in the onset of the following syllable. Analysis (3d)
is proposed by Pulgram (1970), based on two requirements: (i) Max Onset,
discussed above, and (ii) possible rime. Because [ɪ] is not a possible rime (no

The Revised Max Onset 63

word in American English ends in /​ɪ/​), the first syllable cannot be [wɪ] but must
be [wɪs].
The analyses in (3) can be achieved in other ways, too. For example, Prince
and Smolensky (1993) obtains (3a) by the requirements Onset (syllables must
have an onset) and No Coda (syllables must have no coda). Hammond (1999)
obtains (3b) by the requirement Max Coda when there are two (or more)
consonants between vowels. Lowenstamm (1981) obtains (3d) by Max Onset
and a sonority requirement, as discussed above, according to which /​sk/​
cannot fit in an onset but must split between two syllables.
Many studies have attempted to determine syllable boundaries through
experiments. However, native intuition dos not always offer clear answers.
There are cases where agreement is easy to obtain. For example, all native
speakers reject [ɛkstr][ə] and [ɛ][kstrə] for extra, and all accept [æt][ləs] for
atlas, [bə][ɡɪn] for begin, and [hou][tɛl] for hotel. However, native agreement is
hard to obtain on words like whiskey, city, and many others, although there
is some preference for […VC][V…] over […V][CV…] if the first vowel is short
and stressed (Treiman and Danis 1988; Krakow 1989; Treiman and Zukowski
1990; Turk 1994; Kessler and Treiman 1997; Krakow 1999; Eddington
et al. 2013). Therefore, most proposals on syllabification rely on theoretical
assumptions, in particular how onsets and codas should be formed.
Let us take a close look at what a possible syllable is. A common view is that
a syllable is possible if (i) its initial sequence can be found at the beginning of
a word, and (ii) its final sequence can be found at the end of a word (Pulgram
1970). Let us follow Vennemann (1988) and use the terms the Law of Initials
and the Law of Finals, rephrased in (4) and (5), to define the common view.

(4) The Law of Initials (LOI)

The initial sound sequence of a syllable (i.e., the body) ought to be found
in the initial sound sequence of a word.

(5) The Law of Finals (LOF)

The final sound sequence of a syllable (i.e., the rime) ought to be found in
the final sound sequence of a word.

Several comments are in order. First, the LOI applies to the ‘body’ of a syl-
lable, which includes the main vowel. This way the LOI can rule out syllables
like [sfæt] and [sfɛn] correctly, because no word starts with [sfæ] or [sfɛ]. If
the LOI only applies to the onset, then [sfæt] and [sfɛn] would satisfy the LOI
(contrary to the judgment of native intuition), because the onset [sf] is found
in sphere. Second, the LOF applies to the rime of the syllable, which includes
the main vowel. This way the LOF can rule out a syllable like [kæ], because
no word ends in the rime [æ]. If the LOF only applies to the coda, then a
syllable like [kæ] would satisfy the LOF, because it simply lacks a coda, and
many words end with no coda. Third, the LOI and the LOF apply to the sur-
face form of a word. For example, the surface form of Canada is [kænədə].

64 San Duanmu
If we syllabify it as [kæn][ə][də], then both the LOI and the LOF are satis-
fied. However, if the LOF applies to the underlying form of Canada, which
according to Chomsky and Halle (1968) is [kænædə], where the first two
vowels are both [æ], then [kæn][æ][də] would violate the LOF, because the
second syllable ends in [æ], yet no English word does.
To illustrate the application of the LOI and the LOF, consider various
ways to syllabify the word extra, analyzed in (6). When the LOI or the LOF
is violated, an asterisk is shown. When the LOI or the LOF is satisfied, a
check mark is shown, and a sample word is given in parentheses, with relevant
sounds underlined.

(6) LOI and LOF in the syllabification of extra /​ɛkstrə

Syllabification LOF LOI
a. [ɛkstr][ə] * ✓ (about)
b. [ɛkst][rə] ✓ (text) ✓ (repeat)
c. [ɛks][trə] ✓ (index) ✓ (tradition)
d. [ɛk][strə] ✓ (deck) ✓ (strategic)
e. [ɛ][kstrə] * *

In (6a), there is a violation of the LOF, because no word ends in [ɛkstr].

(6a) satisfies the LOI, though, because there are words that start with [ə], such
as about. In (6e), there is both a violation of the LOF, because no word ends
in [ɛ], and a violation of the LOI, because no word starts with [kstrə]. In the
other three cases, both the LOI and the LOF are satisfied. Next, we consider
the LOI and the LOF in the syllabification of whiskey, shown in (7).

(7) LOI and LOF in the syllabification of whiskey (in American English)
Syllabification LOF LOI
a. [wɪ][ski] * ✓ (scheme)
b. [wɪsk][i]‌ ✓ (risk) ✓ (east)
c. [wɪ[s]‌ki] ✓ (miss) ✓ (keen)
d. [wɪs][ki] ✓ (miss) ✓ (keen)

In (7a), the LOF is violated, because no word in American English ends

in [ɪ]. In the other three cases, both the LOI and the LOF are satisfied. The
example shows that an unqualified Max Onset may violate the LOF, whereas
a qualified Max Onset satisfies both the LOI and the LOF.
Next, let us evaluate various approaches by the LOI and the LOF. Since the
ambisyllabic analysis of Kahn (1976) complicates syllable structure, without
obvious advantages over the analysis of Pulgram (1970), we do not consider

The Revised Max Onset 65

it further. Instead, we consider Hoard (1971), Lowenstamm (1981), Halle and
Vergnaud (1987), and Pulgram (1970). Their analyses of Debra and essay are
shown in (8) and (9).

(8) LOI and LOF and the analysis of Debra /​dɛbrə/​

Analysis Proponent Requirements LOF LOI
[dɛb][rə] Hoard Max Stressed Onset, Max Coda ✓ ✓
[dɛ][brə] Lowenstamm Max Onset, Sonority * ✓
[dɛ][brə] Halle & Vergnaud Max Onset * ✓
[dɛb][rə] Pulgram Max Onset, possible rime ✓ ✓

(9) LOI and LOF and the analysis of essay /​ɛsei/​

Analysis Proponent Requirements LOF LOI
[ɛ][sei] Hoard Max Stressed Onset, Max Coda * ✓
[ɛ][sei] Lowenstamm Max Onset, Sonority * ✓
[ɛ][sei] Halle & Vergnaud Max Onset * ✓
[ɛs][ei] Pulgram Max Onset, possible rime ✓ ✓

In Debra, the second syllable has no stress. For Hoard (1971), the coda of
the first syllable should be maximized, yielding [dɛb][rə], which satisfy both
the LOF and the LOI. For Lowenstamm (1981), [br] is a good onset, because
it has increasing sonority, yielding [dɛ][brə], where [dɛ] violates the LOF,
because no word ends in [ɛ]. Similarly, the analysis of Halle and Vergnaud
(1987) violates the LOF. Finally, the analysis of Pulgram (1970) satisfies both
the LOI and the LOF.
In essay, the second syllable has secondary stress. For Hoard (1971),
its onset should be maximized, yielding [ɛ][sei], where [ɛ] violates the LOF.
Similarly, the analyses of Lowenstamm (1981) and Halle and Vergnaud (1987)
violate the LOF. For Pulgram (1970), ‘possible rime’ requires the first syllable
to be [ɛs], yielding [ɛs][ei], which satisfies both the LOI and the LOF.
In summary, while all analyses assume some version of Max Onset, only
Pulgram’s version observes the LOF. Let us redefine the two versions in (10)
and call them Max Onset and Revised Max Onset.

(10) Two versions of maximizing the onset:

Max Onset: Maximize the onset, under the LOI but not the LOF.
Revised Max Onset: Maximize the onset, under both the LOI and the

Given the new definitions, Hoard (1971) assumes Max Onset for stressed
syllables and Max Coda otherwise. Lowenstamm (1981) assumes Max Onset,
with an additional requirement for a consonant sequence to have increasing

66 San Duanmu
sonority in the onset. Halle and Vergnaud (1987) assume Max Onset. Finally,
Pulgram (1970) assumes Revised Max Onset, which also ensures that all rimes
are possible.
Let us now consider syllable weight, which is based on the length of the
rime. A syllable is light if the rime consists of a short vowel without a coda;
otherwise, the syllable is heavy. In English, a long vowel is one that can end
a stressed syllable. In American English, long vowels include [iː uː ei ou ai au
oi ɑː ɒː ɝː], as in see, two, day, go, buy, how, boy, spa, law, and fur respectively.
A short vowel is one that cannot end a stressed syllable, such as [ɪ ʊ ɛ ʌ], as in
sit, book, bed, and bud, or one that is unstressed only, such as [ə ɚ]. The vowel
[æ] is usually thought to be short as well (Chomsky and Halle 1968), although
it is phonetically long and does occur in some marginal words, such as nah
[næː]. Finally, unstressed word final [i u] are sometimes treated as short (Halle
and Vergnaud 1987). In (11) we summarize vowel length in American English.

(11) Vowel length in American English

Long [iː uː ei ou ai au oi ɑː ɒː ɝː]
Short [ɪ ʊ ɛ ʌ], [ə ɚ], ([æ])
Special cases unstressed word final [i u] are short

Given the definition of syllable weight and vowel length, it is clear that
different ways of syllabification lead to different weight patterns. Consider
the word whiskey, whose syllabification and weight patterns are shown in (12).
For visual clarity, a hyphen is added between syllables in the columns under
Rime and Weight. In addition, H and L are shorthand notations for heavy
and light syllables respectively.

(12) Syllabification and syllable weight for whiskey /​wɪski/​

Syllabification Rime Weight Shorthand
[wɪ][ski] [ɪ]-​[i]‌ light-​light LL
[wɪsk][i]‌ [ɪsk]-​[i]‌ heavy-​light HL
[wɪs][ki] [ɪs]-​[i]‌ heavy-​light HL

In (12), [ɪsk] and [ɪs] are both called heavy, although [ɪsk] has an extra con-
sonant. To distinguish them, VCC (such as [ɪsk]) and VVC (such as [aun] in
council) are sometimes called ‘super-​heavy’, in contrast to VC and VV, which
are regular heavy. However, the distinction is of little consequence for our dis-
cussion and is not made here.

2.2 Proposals of word stress in English

Word stress in English is sensitive to syllable weight, in the sense that heavy
syllables tend to attract stress (Liberman and Prince 1977; Halle and Vergnaud

The Revised Max Onset 67

1987; Prince 1992; Hayes 1995). Let us consider two approaches to word stress
assignment, which we can call deterministic and non-​deterministic.

2.2.1 Deterministic assignment of word stress

In the deterministic approach, there is a specific set of requirements or rules
for word stress assignment, and each given sequence of phonemes has just one
solution. Some words satisfy all the requirements, yield the expected solution,
and are considered to have regular stress patterns. Other words fail to satisfy
one or more of the requirements, do not yield the expected solution, and are
considered to have exceptional stress patterns.
The deterministic approach is proposed by Halle and Vergnaud (1987)
and Hayes (1995). For illustration, let us consider the analysis of main stress
in English nouns. According to Halle and Vergnaud (1987: 227), the stress
pattern of English nouns is as in (13).

(13) Main stress in English nouns (Halle and Vergnaud 1987: 227):
Main stress is on the penultimate syllable if it is heavy (e.g., agenda,
Else main stress is on the antepenultimate syllable (e.g., Canada, Mexico)

To obtain the proposed stress pattern, Halle and Vergnaud (1987) propose
an ordered set of rules, which we rephrase in (14), where H is a heavy syl-
lable, L is a light syllable, and parentheses over H or L indicate foot bound-
aries. A general assumption in metrical phonology is that every foot has stress
(either primary or secondary) and every stress implies a foot. In a trochaic
foot with two syllables, stress falls on the one on the left.

(14) Ordered rules for assigning main stress in English nouns (Halle and
Vergnaud 1987)
a. Syllabify according to Max Onset.
b. Exclude the final syllable (if the word has two or more syllables).
c. Build a trochaic foot from the right, which can be (H), (HL), or (LL).
d. Else build (L) instead.

In (15) we show the analysis of some English nouns, both regular ones and
exceptional ones, where * indicates a violation of a rule in (14). Halle and
Vergnaud (1987) consider word final [i]‌to be short in some words, such as
city, which need not concern us.

(15) Analysis of some English nouns according to (14)

Word (14a) Weight (14b) (14c) (14d) Comment
agenda [ə][ɡɛn][də] LHL LH<L> L(H)<L> regular
marina [mə][ri:][nə] LHL LH<L> L(H)<L> regular
Canada [kæ][nə][də] LLL LL<L> (LL)<L> regular

68 San Duanmu

lemon [lɛ][mən] LH L<H> (L)<H> regular

city [sɪ][ti] LL L<L> (L)<L> regular
Mexico [mɛk][sə][ko:] HLH HL<H> (HL)<H> regular
Tennessee [tɛ][nə][si:] LLH * LH(H) (14b) violated
Japan [ʤə][pæn] LH * L(H) (14b) violated
banana [bə][næ][nə] LLL LL<L> * L(L)<L> (14c) violated
textile [tɛk][stail] HH * * (H)(H)

The first six words are regular and the last four exceptional. In Tennessee
and Japan, (14b) fails to exclude the final syllable, which acquires main stress.
In banana, (14c) fails to build (LL); as a result, (14d) builds (L) instead. In tex-
tile, both syllables have stress, where the first has main stress and the second
has secondary stress. This means that (14b) fails to exclude the final syllable
(because excluded syllables cannot be assigned stress). In addition, (14c) fails
to assign main stress to the final syllable; instead, main stress appears on the
preceding syllable. It is worth noting, too, that although lemon and city are
thought to be regular words, their foot (L) is in fact exceptional, because it is
not among the preferred feet in the first step of foot construction (14c). We
shall return to this point.
Hayes (1995) offers a similar analysis, except that he only assumes two
regular foot types, (H) and (LL), each having two moras. His analysis is
rephrased in (16) and illustrated in (17).

(16) Rules for assigning main stress in English nouns (Hayes 1995)
a. Syllabify according to Max Onset.
b. Exclude the final syllable (if the word has two or more syllables).
c. Build a moraic trochee from the right, which can be (H), or (LL).
d. Else build (L) instead.

(17) Analysis of some English nouns according to (16)

Word (16a) Weight (16b) (16c) (16d) Comment
agenda [ə][ɡɛn][də] LHL LH<L> L(H)<L> regular
marina [mə][ri:][nə] LHL LH<L> L(H)<L> regular
Canada [kæ][nə][də] LLL LL<L> (LL)<L> regular
lemon [lɛ][mən] LH L<H> (L)<H> regular
city [sɪ][ti] LL L<L> (L)<L> regular
Mexico [mɛk][sə][ko:] HLH HL<H> (H)L<H> regular
Tennessee [tɛ][nə][si:] LLH * LH(H) (14b) violated
Japan [ʤə][pæn] LH * L(H) (14b) violated
banana [bə][næ][nə] LLL LL<L> * L(L)<L> (14c) violated
textile [tɛk][stail] HH * * (H)(H)

The Revised Max Onset 69

It can be seen that the exceptional words for Hayes (1995) are exactly the
same as those for Halle and Vergnaud (1987). English word stress can also be
analyzed in the framework of Optimality Theory (e.g., Pater 2000), again with
the same set of exceptional words.
To deal with exceptional words, the deterministic approach has to mark
them in some way, so that they do not undergo the same rules or requirements
as regular words. For example, Halle and Vergnaud (1987) and Hammond
(1999) propose that some English words have a lexical mark on a given syl-
lable, which means it must be stressed. Similarly, Pater (2000) proposes that
English words are divided into different classes, so that they are subject to
different constraints. Such proposals essentially acknowledge that English
word stress is not completely predictable.

2.2.2 Non-​deterministic assignment of word stress

In the non-​deterministic approach, there is also a specific set of requirements
or rules for word stress assignment, but a given sequence of phonemes can
satisfy the requirements in more than one way. As a result, all words are good
and no word is exceptional. I discuss two proposals of the non-​deterministic
approach, Burzio (1994) and Duanmu (2007). Burzio’s analysis

The proposal of Burzio (1994) is summarized in (18), where σ represents
either H or L. Thus, the foot type (Hσ) can be (HL) or (HH), and (σLσ) can
be (HLH), (HLL), (LLH), or (LLL).

(18) Constraints for word stress in English (Burzio 1994)

a. Max Onset.
b. Main stress falls on the first foot from right.
c. The only good feet are (Hσ) and (σLσ), both being trochaic.
d. A word can end in a ‘null vowel’.
e. A final L can be left outside of a foot.
The analysis applies to not just nouns but all English words. For illustra-
tion, some examples are shown in (19), where we use Ø to represent a null
vowel. A syllable with a null vowel is treated as L. Following Chomsky and
Halle (1968), Burzio considers an unstressed final [i]‌to be a short vowel.

(19) Analysis of some English nouns according to (18)

Word Syllabification Foot Foot type Comment
agenda [ə][ɡɛn][də] L(HL) (Hσ)
marina [mə][ri:][nə] L(HL) (Hσ)
Canada [kæ][nə][də] (LLL) (σLσ) Max Onset
lemon [lɛ][mə][nØ] (LLL) (σLσ) Max Onset, null vowel

70 San Duanmu

Mexico [mɛk][sə][ko:] (HLH) (σLσ)

Japan [ʤə][pæn][nØ] L(HL) (Hσ) Null vowel, geminate [nn]
pan [pæn][nØ] (HL) (Hσ) Null vowel, geminate [nn]
banana [bə][næn][nə] L(HL) (Hσ) Geminate [nn]
sardine [sar][di:][nØ] H(HL) (Hσ) Null vowel
alpine [æl][pai][nØ] (HH)L (Hσ) Null vowel, skipped L
city [sɪt][ti] (HL) (Hσ) Geminate [tt]

The analysis of Canada [kæ][nə][də] (LLL) and lemon [lɛ][mə][nØ] (LLL)

shows that Buzio assumes Max Onset that ignores the Law of Finals. If so,
banana ought to yield an ill-​formed result [bə][næ][nə] L(LL), where (LL) is
not in his inventory of good feet. For banana to yield L(HL), Burzio makes
the claim that banana has a geminate consonant [nn], so as to yield a well-​
formed foot (HL). Similarly, words like city have a geminate consonant in
order to yield (HL) and avoid (LL). Finally, Burzio assumes that every word
ends in a vowel; those that end in a consonant have a final ‘null vowel’. This
way, words that have final stress, such as pan and Japan, also have a good foot
(HL) rather than a bad foot (H). It is worth noting that Max Onset applies to
the null vowel, too, so that pan must have a geminate [nn]; otherwise, it would
become [pæ][nØ], yielding an ill-​formed foot (LL).
The point of interest here is that Burzio’s constraints can be satisfied in
more than one way. For example, the weight pattern HHL can yield the foot
structure (HH)L, as in alpine, or H(HL), as in sardine.
Like the deterministic approach, the non-​deterministic approach assumes
that English word stress is not fully predictable, because each word may choose
its own way to satisfy the set of requirements. However, unlike the determin-
istic approach, which treats some words as regular and some as exceptional,
Burzio treats all English words as equally well formed, at least with regard to
syllable structure and foot structure, although as Burzio acknowledges, evi-
dence for geminate consonants is rather weak.
Burzio’s analysis has several shortcomings, though. First, it is unclear what
the relation is among the good feet. Burzio suggests that they have similar weight
values, and he proposes a rather idiosyncratic way of calculating the total weight
of a foot. But if we assume the traditional view that H has two moras and L
has one, Burzio’s feet range from three moras in (HL) to five in (HLH), which
is quite a range. In addition, why is (LHL) a bad foot, while (HLL) and (LLH)
are good ones, even though they all have four moras each? Similarly, why is (LH)
a bad foot, while (HL) is a good one, even though they both have three moras
each? The second shortcoming is that trisyllabic feet are fairly rare and metrical
theory would be simpler without assuming them. Third, most people consider
the second syllable of alpine to have secondary stress (e.g., Chomsky and Halle
1968; Halle and Vergnaud 1987), yet Burzio considers it to have no stress. The
same problem can be raised for verify (LLH) and notify (HLH), where the final

The Revised Max Onset 71

H is often thought to have secondary stress. Finally, Burzio assumes inconsistent
syllabification for the syllable with main stress. For example, the first syllable in
city is heavy, whereas that in Canada is light, even though (HLL) is an allowable
foot in his analysis. Obviously, the problem arises from (i) the assumption of
Max Onset and (ii) the desire to disallow (LL) and (LH). Max Onset yields LLL
for Canada, but would also yield unwanted (LL) for city and banana. To avoid
(LL), Burzio proposes that some words have an abstract ‘geminate’ consonant,
such as [nn] in banana [bənænnə] and [tt] city [sɪtti], even though he acknow-
ledges that the proposal is circular. The present analysis

I would like to offer a better version of the non-​deterministic approach, without
the problems in Burzio’s analysis. First, I propose that English has both moraic
trochee (Hayes 1995) and syllabic trochee (Halle and Vergnaud 1987), similar
to Chinese (Duanmu 2007). This proposal differs from a common view that a
language can only choose one foot type (at its lowest level of metrical structure).
However, there is good evidence that a language can have both. For example,
as discussed in Duanmu (2007), there is a contrast in Chinese between heavy
syllables, which can carry stress and tone, and light syllables, which cannot
carry stress or tone. This calls for counting moras, so that each heavy syllable
is a moraic foot. In addition, Chinese has a strong requirement for a minimal
word to be disyllabic and a strong preference for certain word-​length combin-
ations over others, which calls for a disyllabic trochee as well. English is similar
to Chinese in the sense that stress is sensitive to syllable weight, which means
that English must count moras (Hayes 1995). In addition, in many English
words main stress is on the third syllable from the right, and a syllabic trochee
is a simple way to account for it (Halle and Vergnaud 1987).
According to Duanmu (2007), there are only three well-​ formed foot
structures, shown in (20), where x represents stress, a dot represents a syl-
lable boundary, and 0 represents an unstressed syllable. Among the three foot
structures, (mm) is a heavy syllable, which is always stressed. In (HL), only the
first syllable has stress. In (HH), both syllables have stress, but the first has more.

(20) Three well-​formed foot structures (Duanmu 2007)

Name Shorthand Structure
Moraic trochee (mm) x
Syllabic trochee (HL) x
(x    0)
Syllabic trochee (HH) x‑
(x     x)

72 San Duanmu
It is worth noting that there is no stressed L. This means that, unlike Halle
and Vergnaud (1987) and Hayes (1995), for whom (L.L) is a possible foot,
in the present analysis it is not. The present analysis agrees with two facts.
First, in Chinese, where syllable boundaries are clear, no L can carry stress
or tone. Second, in English no stressed final syllable is L, even though both
Halle and Vergnaud (1987) and Hayes (1995) allow L to be an exceptional
foot. Moreover, as we have seen above, while syllable boundaries are not
always obvious in English, Revised Max Onset can ensure that all stressed
syllables are H.
It is also worth noting that, in (HH), there is no stress clash, because at
the moraic level, the two stresses are separated by an unstressed mora. In
addition, by treating (HH) as a regular foot, we avoid a problem in previous
analyses. Specifically, in Halle and Vergnaud (1987), for words like alpine and
moron, main stress is assigned to the second syllable, and then a special rule
is used to shift the stress to the left. Similarly, Burzio (1994) has to make the
unusual claim that the second syllable in words like alpine and moron has no
secondary stress, contrary to many other people’s judgment. In the present
analysis, such words need no special treatment.
The proposed foot structures can be derived from two well-​ known
constraints, Foot Binarity and the Weight-​Stress Principle, shown in (21),
along with Revised Max Onset, Parse2, Main Stress, and Null Beat, to
account for syllabification and word stress in English.

(21) Constraints on syllabification, foot structure, and word stress

a. Foot Binarity (FtBin): Every foot must have two beats.
b. Weight-​Stress Principle (WSP): H has stress; L has no stress.
c. Revised Max Onset (RMO)
d. Parse2: Two free beats must form a foot.
e. Main Stress: Main stress must be that of a syllabic foot.
f. Null Beat: A null beat counts as L and is realized as a pause or pre-​
pause lengthening.

Foot Binarity requires a moraic foot to contain two moras and a syllabic
foot to contain two syllables (Prince 1980). The WSP has two parts. The first
part is similar to what Prince (1992) calls the Weight-​to-​Stress Principle, which
requires H to be stressed. The second part is similar to what Prince (1992)
calls the Stress-​to-​Weight Principle, which excludes (m.m) or (LL) from being
a possible foot, because there is a stressed L. Prince (1992) rejects the second
part of the WSP, in part because many English words, such as sanity, banana,
and city, seem to have a stressed L. However, as I have shown, the problem
arises from Max Onset. If we assume Revised Max Onset instead, then both
parts of the WSP can be maintained.
Parse2 requires every heavy syllable to form a moraic foot and have
stress, because it contains two moras (two moraic beats). In addition, Parse2
disallows two adjacent free syllables (two syllabic beats). On the other hand,

The Revised Max Onset 73

Parse2 allows one L to be left alone (without a foot). Now it can be seen
that there is an overlap between the WSP and Parse2, both requiring H to be
stressed. A possible solution is to replace the WSP with a requirement that
a moraic foot cannot contain a syllable boundary, or *(m.m). Interestingly,
although Hayes (1995) allows (m.m) as a possible foot, he needs a constraint
to prevent a syllable from being split by a foot boundary, that is, m(m.m) for
HL, (m.m)m for LH, and (m.m)(m.m) for LHL. It can be seen that *(m.m) is
sufficient to rule out such cases.
Main Stress agrees with the fact that stress in words like France or Berlin is
as strong as main stress in nation, Chicago, or compensation; this is achieved
by the representation that in all these words main stress falls on a syllabic foot.
Finally, Null Beat claims that the constraint is physically real and verifiable;
this accounts for the well-​known fact that a stressed pre-​pause English syl-
lable is much longer than a stressed non-​final one (Price et al. 1991).
With the above constraints, let us consider the analysis of some English words,
including their syllables and foot structures, shown in (22), where 1 indicates pri-
mary stress, 2 indicates secondary stress, and 0 indicates lack of stress.

(22) Analysis of some English words according to (21)

Word Syllables Foot Stress Comment
agenda [ə][ɡɛn][də] L(HL) 0–​1–​0
marina [mə][ri:][nə] L(HL) 0–​1–​0
Canada [kæn][ə][də] (HL)L 1–​0–​0
lemon [lɛm][n]‌ (HL) 1–​0 Syllabic [n]‌
Mexico [mɛk][sə][ko:] (HL)(mm) 1–​0–​2
Japan [ʤə][pæn]Ø L(HL) 0–​1–​0 Null beat
pan [pæn]Ø (HL) 1–​0 Null beat
banana [bə][næn][ə] L(HL) 0–​1–​0
sardine [sar][di:n]Ø (mm)(HL) 2–​1–​0 Null beat
alpine [æl][pain] (HH) 1–​2
city [sɪt][i]‌ (HL) 1–​0

The analysis shows that the same CV string, such as CVCVCV in Canada
and banana, can satisfy the constraints in more than one way and yield more
than one good solution. It can be shown, too, that every English word has at
least one way to satisfy all the constraints.

2.3 A set of criteria

Let us now evaluate various approaches to syllabification and stress
assignment, using a common set of criteria. It is reasonable to say the criteria
in (23) are desired for all approaches.

74 San Duanmu

(23) A common set of criteria to satisfy

a. LOI (the Law of Initials)
b. LOF (the Law of Finals)
c. WSP (the Weight-​Stress Principle)
d. FtBin (Foot Binary)
e. No Marking: Avoid marked words (exceptional words).

The LOI, the LOF, the WSP, and FtBin have been discussed above. No
Marking aims to minimize exceptional or marked words. The evaluation of
various approaches to syllabification and stress assignment is shown in (24),
where HV refers to Halle and Vergnaud (1987).

(24) Evaluation of approaches to syllabification and stress assignment

LOI LOF WSP FtBin No Marking
Max Onset ✓ * *
RMO ✓ ✓ ✓
HV (MO); Hayes (MO) ✓ * * * *
Burzio (MO) ✓ * * ✓ *
Present (RMO) ✓ ✓ ✓ ✓ ✓

As discussed above, Max Onset ignores the LOF, because it creates stressed
light syllables, such as the first syllable in Canada [kæ][nə][də] and very [vɛ][ri],
which are not found in word-​final positions. In addition, such stressed light
syllables violate the WSP. In contrast, RMO always satisfies the LOI, the LOF,
and the WSP. There are two reasons. First, word-​initial vowels are common,
which means that syllables without an onset can still satisfy the LOI. Second,
stressed word-​final syllable are always heavy and satisfy the WSP, and conse-
quently, the LOF requires stressed non-​final syllables to be syllabified in the
same way, which means they always satisfy the WSP, too.
Next we consider stress assignment and foot structure. First, in the deter-
ministic approach, both Halle and Vergnaud (1987) and Hayes (1995) assume
Max Onset, which violates the LOF and the WSP, as just discussed. In addition,
because they assume the exclusion of the final syllable, words like very and city
will end up with just one short syllable, which is made into a foot by itself, which
violates FtBin, regardless of whether we assume moraic feet (Hayes 1995) or
syllabic feet (Halle and Vergnaud 1987). Finally, in the deterministic approach,
some words are regular and some exceptional, which violates No Marking.
Although Burzio (1994) assumes a non-​deterministic approach, he assumes
Max Onset, too. Therefore, his analysis violates the LOF. In addition, to make
sure that words like city, very, and disco have a stressed heavy syllable, as
required by the foot (Hσ), these words have to be marked with an underlying
geminate consonant, which violates No Marking.

The Revised Max Onset 75

The present analysis assumes RMO, which always satisfies the LOI, the
LOF, and the WSP. In addition, given the null beat that is available in pre-​
pause position, a fact that is independently motivated, FtBin is always satis-
fied, and so is No Marking. Moreover, the inclusion of (HH) as a good foot
avoids the need to treat words like disco and alpine as exceptional ones that
need special marking or undergo different requirements or rules.

2.4 Why does Max Onset ignore the LOF?

Given the obvious advantages of RMO, as just seen, one would wonder why
there are analyses that choose Max Onset instead. The main reason, it seems
to me, is the traditional assumption in generative grammar that phonology
consists of an ordered set of rules. Specifically, there is an assumption that syl-
labification precedes stress assignment and vowel reduction. For illustration,
consider the analysis of Canada. According to Chomsky and Halle (1968),
English has a rule, given in (25), which reduces unstressed short ([-​tense])
vowels to [ə].

(25) Vowel Reduction in English (Chomsky and Halle 1968: 111)

[-​stress, -​tense, V]  [ə]

In addition, according to Chomsky and Halle (1968), the underlying form

of Canada is [kænædə]. The first [æ] shows up in Canada. The second [æ]
shows up as [ei] in Canadian, after other rules that need not concern us. Now
let us consider how [kænædə] can be syllabified, before stress is assigned.
Some options are shown in (26).

(26) Possible syllabifications of Canada [kænædə]

Method Syllables LOF Stress
Max Onset [kæ][næ][də] **
Max Coda [kæn][æd][ə] *
Max First Coda [kæn][æ][də] *

If we syllabify according to Max Onset, the LOF is violated by the first two
syllables. If we syllabify according to Max Coda, the LOF is satisfied, but the
second syllable causes a problem for stress assignment: It is H, yet it does not
attract stress. If we maximize the coda of the first syllable only (and maximize
the onset of other syllables), the second syllable still violates the LOF. In sum-
mary, given Chomsky and Halle’s analysis of underlying forms, if syllabifica-
tion precedes stress assignment, there is no way to satisfy the LOF, without
causing problems for stress assignment.
A solution is available if we give up the assumption that syllabification
precedes stress assignment, and assume instead that they can be evaluated

76 San Duanmu
simultaneously. The solution is made possible in a constraint-​based ana-
lysis (Prince and Smolensky 1993). For illustration, consider the analysis
of the string CVCVCV, which represents words like Canada, banana, Sicily,
committee, and so forth. Assuming the constraints discussed earlier, possible
syllabifications and foot structures of this string are shown in (27), where
Main refers to the requirement for main stress to fall on a syllabic foot.

(27) Possible analyses of CVCVCV: many good solutions and many bad ones
CVCVCV FtBin WSP RMO Parse2 Main
[CVC][ə][Cə] (HL)L ✓ ✓ ✓ ✓ ✓
[Cə][CVC][ə] L(HL) ✓ ✓ ✓ ✓ ✓
[CVC][ə][Cə] (mm)LL ✓ ✓ ✓ * *
*[Cə][CV][Cə] L(LL) ✓ * * ✓ ✓
*[CV][Cə][Cə] (LL)L ✓ * * ✓ ✓
*[CV][Cə][Cə] (L)LL * * * * *

Of the six options shown, only two satisfy all the constraints, represented
by Canada for (HL)L and banana for L(HL). The other four analyses violate
one or more of the constraints. It is worth noting that it is of little conse-
quence whether Canada has an underlying form [kænædə], as proposed by
Chomsky and Halle (1968), or whether it is simply [kænədə], as proposed by
Burzio (1996). Similarly, let us consider another string CVCCVV, shown in
(28), where VV is a long vowel or diphthong.

(28) Possible analyses of CVCCVV: many good solutions and many bad ones
CVCCVV FtBin WSP RMO Parse2 Main
[CVC][CVV] (HH) ✓ ✓ ✓ ✓ ✓
[Cə][CCVV]Ø L(HL) ✓ ✓ ✓ ✓ ✓
[CVC][CVV]Ø (mm)(HL) ✓ ✓ ✓ ✓ ✓
*[CVC][CVV] (mm)(mm) ✓ ✓ ✓ * *
*[CVCC][VV] (HH) ✓ ✓ * ✓ ✓
*[CV][CCVV] (LH) ✓ * * ✓ *
*[Cə][CCVV] L(mm) ✓ ✓ ✓ ✓ *
*[CVC][CVV]Ø H(HL) ✓ * * ✓ ✓
*[CVCC][VV]Ø (mm)(HL) ✓ ✓ * ✓ ✓

Of the various options, just three are good, (HH) as in disco, L(HL) as in
supply, and (mm)(HL) as in Bantu. Let us consider why other options are not

The Revised Max Onset 77

For disco, the foot structure cannot be (mm)(mm), because (i) the two
syllables have not formed a syllabic foot, hence violating Parse2 at the syl-
lable level, and (ii) main stress is not in a syllabic foot, violating Main Stress.
The foot cannot be (LH) either, because the first syllable has stress, yet it is
L, hence violating the WSP. The syllable structure cannot be [CV][CCVV]
[dɪ][skou], which violates RMO, because no word ends in a stressed [ɪ]. The
syllable structure cannot be [CVCC][VV] [dɪsk][ou], which violates RMO,
because there is no reason for [k]‌to be in the coda of the first syllable, rather
than in the onset of the second.
For supply, the foot structure cannot be L(mm), because (i) there are two
free syllables, violating Parse2, and (ii) the main stress is not in a syllabic foot,
violating Main Stress. The syllable structure cannot be [CVC][CVV] [səp][lai]
either, because (i) the first syllable is H but has no stress, violating the WSP,
and (ii) there is no reason for [p]‌to be in the first syllable, a violation of RMO.
Finally, in Bantu, the syllable structure cannot be [CVCC][VV] [bænt][u:],
which violates RMO, because there is no reason to include [t]‌in the first syl-
lable. In addition, the stress pattern cannot be H(HL), where the initial H has
no stress, violating WSP.
We have seen then that syllabification, foot structure, and stress can be
evaluated simultaneously. In addition, in a non-​deterministic approach, there
are many ways to be well formed (i.e., to satisfy the constraints of grammar),
while there are many ways to be ill-​formed as well (i.e., to violate one or more
constraints). Therefore, the proposed analysis has explicit predicative power.

2.5 Conclusions
I have shown that Max Onset, a widely used rule for syllabification, satis-
fies the Law of Initials (LOI) but violates the Law of Finals (LOF). In con-
trast, the Revised Max Onset (RMO) satisfies both. In addition, Max Onset
creates stressed light syllables and violates the Weight-​Stress Principle (WSP),
whereas RMO does not.
I have shown, too, that Max Onset is the only option in a derivational
approach to phonology (e.g., Halle and Vergnaud 1987), where a word under-
goes a set of ordered rules, first those for syllabification and then those for
stress assignment. In contrast, in a constraint-​based approach to phonology,
where syllabification and stress assignment can be evaluated simultaneously,
RMO becomes possible.
I have also compared two approaches to stress assignment. In the deter-
ministic approach (e.g., Halle and Vergnaud 1987; Hayes 1995; Hammond
1999), some words are thought to be regular and others exceptional. In con-
trast, in the non-​deterministic approach, all words are regular and no word
is exceptional. The non-​deterministic approach is achieved by keeping the
constraints that are observable by all words, and leaving out the constraints
that are violated by ‘exceptional’ words. For example, in the deterministic
approach, there is a requirement to skip the final syllable, which is observed

78 San Duanmu
by Canada but violated by Japan. In the non-​deterministic approach, there is
no such requirement, and a word form can choose to skip the final syllable,
as Canada does, or keep it, as Japan does. Both approaches agree that English
word stress is not completely predictable and lexical markings are required.
In the deterministic approach, the markings indicate which words are regular
and which exceptional. In the non-​deterministic approach, the markings indi-
cate which way a word chooses to satisfy the constraints.
The present analysis shows that some phonological constraints are much
stronger than previously thought. For example, RMO ensures that every
stressed syllable is heavy, which supports the second part of the WSP. that
is, not only must heavy syllables be stressed (a point Prince 1992 argues for),
but also light syllables must be unstressed (a point Prince 1992 believes to
be frequently violated). Similarly, in the deterministic approach, where the
final syllable is skipped, Canada has a binary foot, but banana does not. In
the present approach, both Canada and banana have a binary foot, and so do
all other words. Thus, contrary to a central claim in Optimality Theory that
all constraints are in principle violable (Prince and Smolensky 1993), some
constraints do not seem to be so. The present study intends to show that such
constraints merit greater attention than they have received.

Bailey, C.-​J. N. (1978) Gradience in English syllabification and a revised concept of
unmarked syllabification. Bloomington: Indiana University Linguistics Club.
Burzio, L. (1994) Principles of English stress. Cambridge: Cambridge University
Burzio, L. (1996) “Surface constraints versus underlying representation” in Durand,
J., and Laks, B. (eds.) Current trends in phonology: Models and methods, vol. 1.
Salford: European Studies Research Institute, University of Salford Publications,
pp. 123–​141.
Chomsky, N., and Halle, M. (1968) The sound pattern of English. New York: Harper
and Row.
Duanmu, S. (2007) The phonology of standard Chinese, 2nd edition. Oxford: Oxford
University Press.
Eddington, D., Treiman, R., and Elzinga, D. (2013) “Syllabification of American
English: Evidence from a large-​scale experiment Part I”, Journal of Quantitative
Linguistics, 20(1), pp. 45–​67.
Halle, M., and Vergnaud, J.-​R. (1987) An essay on stress. Cambridge, MA: MIT Press.
Hammond, M. (1999) The phonology of English: A prosodic optimality theoretic
approach. Oxford: Oxford University Press.
Hayes, B. (1995) Metrical stress theory: Principles and case studies. Chicago: University
of Chicago Press.
Hoard, J. E. (1971) “Aspiration, tenseness, and syllabification in English”, Language,
47(1), pp. 133–​140.
Jespersen, O. (1904) Lehrbuch der Phonetik. Leipzig; Berlin: Teubner.
Kahn, D. (1976) Syllable-​based generalizations in English phonology. Ph.D. Diss.,
Massachusetts Institute of Technology.

The Revised Max Onset 79

Kessler, B., and Treiman, R. (1997) “Syllable structure and the distribution of phonemes
in English syllables”, Journal of Memory and Language, 37(3), pp. 295–​311.
Krakow, R. A. (1989) The articulatory organization of syllables: A kinematic analysis
of labial and velar gestures. Ph.D. Diss., Yale University.
Krakow, R. A. (1999) “Physiological organization of syllables: A review”, Journal of
Phonetics, 27(1), pp. 23–​54.
Liberman, M., and Prince, A. (1977) “On stress and linguistic rhythm”, Linguistic
Inquiry, 8(2), pp. 249–​336.
Lowenstamm, J. (1981) “On the maximal cluster approach to syllable structure”,
Linguistic Inquiry, 12(4), pp. 575–​604.
Pater, J. (2000) “Non-​uniformity in English secondary stress: The role of ranked and
lexically-​specific constraints”, Phonology, 17(2), pp. 237–​274.
Price, P., Ostendorf, M., Shattuck-​Hufnagel, S., and Fong, C. (1991) “The use of
prosody in syntactic disambiguation”, Journal of the Acoustical Society of America,
90(6), pp. 2956–​2970.
Prince, A. (1980) “A metrical theory for Estonian quantity”, Linguistic Inquiry, 11(3),
pp. 511–​562.
Prince, A. (1992) “Quantitative consequences of rhythmic organization” in Ziolkowski,
M., Noske, M., and Deaton, K. (eds.) CLS 26, Papers from the 26th Regional
Meeting of the Chicago Linguistic Society, volume 2: The parasession on the syllable
in phonetics and phonology. Chicago: Chicago Linguistic Society, pp. 355–​398.
Prince, A., and Smolensky, P. (1993) Optimality theory: Constraint interaction in gen-
erative grammar. MS Thesis, Rutgers University and University of Colorado.
Pulgram, E. (1970) Syllable, word, nexus, cursus. Janua linguarum Series minor 81. The
Hague: Mouton.
Treiman, R., and Danis, C. (1988) “Syllabification of intervocalic consonants”, Journal
of Memory and Language, 27(1), pp. 87–​104.
Treiman, R., and Zukowski, A. (1990) “Toward an understanding of English syllabifi-
cation”, Journal of Memory & Language, 29(1), pp. 66–​85.
Turk, A. (1994) “Articulatory phonetic clues to syllable affiliation: Gestural
characteristics of bilabial stops” in Keating, P. A. (ed.) Phonological structure and
phonetic form: Papers in laboratory phonology III. Cambridge, UK; New York:
Cambridge University Press, pp. 107–​135.
Vennemann, T. (1988) Preference laws for syllable structure and the explanation of
sound change. Berlin; New York: Mouton de Gruyter.
Wells, J. C. (1990) “Syllabification and allophony” in Ramsaran, S. (ed.) Studies in
the pronunciation of English: A commemorative volume in honour of A.C. Gimson.
London; New York: Routledge, pp. 76–​86.

3 Enclitics and the clitic group

consisting of “host+enclitic”
in the Fuzhou dialect1
Shuxiang You

3.1 Introduction
The Fuzhou dialect is the representative dialect of the Eastern Min dialect
group of Chinese. Fuzhou has a complex phonological system, and the com-
plexity lies in the fact that sound changes may occur to the initials, finals,2
and tones of all the participating syllables in a string of sounds (cf. Chen and
Norman 1965; Chan 1985; Chen 1998; Li 2002, among others). Before we
proceed to the discussion about enclitics and the clitic group composed of
“host+enclitic” in Fuzhou, let us first go over a brief introduction to Fuzhou
phonological phenomena relevant to the discussion in this chapter.
The first Fuzhou phonological phenomenon examined here is
Phonological Tone Sandhi (henceforth TS). TS stipulates that the citation
tone of a non-​final syllable is changed into a sandhi tone depending on its
original tonal value and that of the tone of the following syllable within a
given domain (cf. Chen and Norman 1965; Chan 1980, 1985; Wright 1983;
Shih 1986; Hung 1987; Zhang 1992; Chan 1998; Chen 1998; You 2017,
among others). It has long been noticed that TS may apply to lexical items,
as in (1), and phrases consisting of independent words, as in (2). Citation
forms are presented on the left of “→”, while sandhi forms are presented
on the right. For the sake of brevity, only sandhi forms of tones (marked in
bold) are presented here.

(1) Application of TS in lexical items

a. 沙发 sa44 xuaʔ23 → sa51 xuaʔ23 ‘sofa’
b. 老鼠 lo31 tshy31 → lo24 tshy31 ‘mouse’

(2) Application of TS in phrases

a. 食 饭 b. 野 俊
siɛʔ5 puoŋ242 ʔia31 tsouŋ213
→ siɛʔ 21
puoŋ242 → ʔia 44
eat rice very beautiful
‘to eat food’ ‘very beautiful’

Enclitics and the clitic group 81

Nevertheless, TS does not apply to all types of lexical items and phrases,
as illustrated in (3) and (4). The position where TS fails to apply is marked
with “#”.

(3) Blocking of TS in lexical items3

a. 拍拍 phaʔ23 phaʔ23 → phaʔ21 # phaʔ23 *phaʔ51 phaʔ23 (TS) ‘bat’
b. 袋袋 toy242 toy242 → toy21 # toy242 *toy51 toy242 (TS) ‘bag’

(4) Blocking of TS in phrases4

a. 食 完 b.侬 侈
siɛʔ5 ʔuoŋ51 nøyŋ51 sɛ242
→ siɛʔ5 # ʔuoŋ51 → nøyŋ51 # sɛ242
*siɛʔ31 ʔuoŋ51 *nøyŋ21 sɛ242
eat finish people many
‘to finish eating’ ‘there are many people’

The second phonological phenomenon is Initial Consonant Lenition

(henceforth CL). The initial of a non-​first syllable within a domain containing
two or more syllables is usually changed according to the final of the pre-
ceding syllable (cf. Chen and Norman 1965; Chan 1985; Shih 1986; Li et al.
1994; Chen 1998; You 2017, among others). Similar to TS, CL may apply to
lexical items, as in (5), and phrases, as in (6). Sandhi forms of both tones and
initials are presented in the following examples, in which initials in question
are marked in bold.

(5) Application of CL in lexical items

a. 沙发 sa44 xuaʔ23 → sa51 ʔuaʔ23 ‘sofa’
b. 老鼠 lo31 tshy31 → lo24 ʒy31 ‘mouse’

(6) Application of CL in phrases

a. 旧 书 b. 野 好
kou242 tsy44 ʔia31 xo31
→ kou44 ʒy44 → ʔia24 ʔo31
old book very good
‘old book’ ‘very good’

The application of CL is not obligatory in all types of lexical items and

phrases either, as in (7) and (8). The position where CL is blocked is marked
with “#”.

(7) Blocking of CL in lexical items5

a. 杯杯 pui44 pui44 → pui31 # pui44 *pui31 βui44 ‘cup’
b. 桶桶 thøyŋ31 thøyŋ31 → thøyŋ31 # thøyŋ31 *thøyŋ31 løyŋ31 ‘bucket’

82 Shuxiang You

(8) Blocking of CL in phrases6

a. 食 饱 b.买 锅
siɛʔ5 pa31 mɛ31 kuo44
→ siɛʔ 5
# pa31 → mɛ 21
# kuo44
*siɛʔ 5
βa31 *mɛ 21
eat full buy pan
‘to have eaten to one’s fill’ ‘to buy a pan’

From the examples in (1–​8), we can find that in the domains formed by
lexical items and phrases in the Fuzhou dialect, both TS and CL apply in
some strings of sounds, while they are blocked in others. As we will see in
the following sections, the domain formed by the clitic group consisting of
“host+enclitic” is quite different from the domains formed by lexical items
and phrases, in terms of the application/​blocking of TS and CL.
The following sections are organized as follows. Section 3.2 presents an
introduction to clitics in general and the clitic group in prosodic phonology.
Section 3.3 identifies Fuzhou enclitics and explores their morphosyntactic
functions. Section 3.4 examines the phonological properties of the clitic
group composed of “host+enclitic” in Fuzhou with respect to the application/​
blocking of TS and CL. Section 3.5 discusses the violation of the Strict Layer
Hypothesis (SLH) caused by the clitic group consisting of “host+enclitic” in
Fuzhou. Section 3.6 concludes this study.

3.2 Clitics and the clitic group cross-​linguistically

3.2.1 Properties of clitics cross-​linguistically

Many languages contain a specific type of elements, often referred to as
clitics, which are phonologically dependent and have to “lean on” the adja-
cent host. Depending on their position in relation to the host they attach to,
clitics are mainly divided into proclitics and enclitics. A clitic preceding its
host is called proclitic, while a clitic appearing after its host is called enclitic.
Fuzhou clitics examined in this chapter all attach to the right of their hosts
and are thus enclitics,7 as to be seen in Section 3.3. It has long been recognized
that this specific type of elements “exhibits some of the properties of the word
and some of the properties of the affix” (Klavans 1982; also cf. Zwicky 1977;
Crystal 2008, among others).
The mixed behavior and unclear linguistic status of clitics have posed
problems for linguists. Starting with Zwicky’s (1977) pioneering study, a vast
amount of research has been devoted to identifying the properties of clitics
cross-​linguistically. Many linguists argue that clitics represent an independent
category due to their morphosyntactic and phonological properties and should
be distinguished from both words and affixes (Hayes 1984/​1989; Nespor and
Vogel 1986, henceforth N&V 1986; Haspelmath and Sims 2010, among others).

Enclitics and the clitic group 83

On the one hand, clitics should be distinguished from independent words
in several ways. First, clitics exhibit a type of phonological dependency, while
independent words are free in terms of their occurrence. Due to the phono-
logical dependency of clitics, it is impossible to (a) pause between a clitic and
its host, (b) assign stress to clitics in stress languages, (c) assign contrastive
stress to clitics, and (d) freely move clitics in an utterance (Haspelmath and
Sims 2010). The second property distinguishing clitics from independent
words is clitics commonly belong to some functional and considerably
closed categories, including auxiliaries, pronouns, determiners, prepositions,
postpositions, conjunctions, and functional particles like negatives and inter-
rogative particles (Zwicky 1977; Klavans 1982). By contrast, independent
words typically come from open categories such as nouns, verbs (excluding
auxiliaries), and adjectives.
On the other hand, several criteria have been proposed to distinguish
clitics and affixes (Zwicky and Pullum 1983; Haspelmath and Sims 2010,
among others): (a) clitics can attach to words of virtually any category,
while affixes are quite specific in their selections of stems; (b) clitics do
not exhibit arbitrary gaps, while affixes do; (c) clitics do not exhibit mor-
phophonological idiosyncrasies, while irregular forms are quite common
in groupings of stems and affixes; (d) the meaning of the string of the host
plus the clitic(s) is predictable from the meaning of the host and that of the
clitic(s), while affix-​stem combinations may have an idiosyncratic meaning;
(e) an affixed word is regularly treated as one unit by syntactic operations,
while a string of the host plus the clitic(s) is usually treated as two (or more
than two) separated units by syntactic operations; and (f) clitics can attach
to material already containing clitics or affixes, but affixes cannot attach to
a host containing clitics.
Furthermore, as argued by a number of linguists (e.g., Hayes 1984/​1989;
N&V 1986, among others), the phonological behavior of clitics is often
different from that of both independent words and affixes. Specifically, in a
given language, some phonological phenomena apply only in relation to a
constituent consisting of a host plus the clitic, namely the clitic group. Hence,
the role played by the clitic group as the domain of application for various
phonological generalizations can serve as another important criterion to dis-
tinguish clitics from both independent words and affixes.

3.2.2 The clitic group in prosodic phonology Basic premises and assumptions of prosodic phonology

Prosodic phonology, as developed in Selkirk (1978/​1981, 1986), N&V (1986),
and other pioneering works (Booij 1983, 1985; Hayes 1984/​1989, among
others), over 30 years ago, stands as a representative phonological theory
of the interactions between phonology and other components of the
grammar. Within the model of the prosodic phonology theory, there exists

84 Shuxiang You
a hierarchically arranged organization called Prosodic Structure between the
morphosyntactic and phonological components. A given string of sounds is
organized into a series of hierarchically arranged prosodic constituents, with
each prosodic constituent serving as the domain of application for specific
phonological rules and phonetic processes. Thus phonological operations do
not refer to syntactic constituents in a direct way but instead to the already
created prosodic constituents. Hence, the existence of phonological rules and
phonetic processes that make reference to a particular prosodic constituent
is viewed as one significant motivation for the establishment of the prosodic
constituent itself in a given language.
The earliest prosodic hierarchy proposed by Selkirk (1978/​1981) contains
only the syllable, the foot, the phonological/​prosodic word,8 the phonological
phrase, the intonational phrase, and the utterance. Hayes (1984/​1989) and
N&V (1986) added and inserted the clitic group between the phonological
word and the phonological phrase, and Zec (1988) proposed the mora (μ),
the lowest constituent in the hierarchy. Prosodic constituents are defined by
making use of different types of phonological and non-​phonological infor-
mation. According to the types of information to which different constituents
are sensitive, Zhang (1992, 2017) proposed a trisected model for prosodic
hierarchy, as given in Figure 3.1.
The only well formedness condition on prosodic constituency is laid down
in the SLH, formulated in Selkirk (1984), stipulating that in the prosodic
hierarchy, a prosodic constituent of a given level n immediately dominates
only constituents of the lower level n-​1, and is exhaustively contained in a
constituent of the immediately higher level n+1. In responding to evidence
and criticisms that have challenged the SLH, Selkirk (1996) has factored out
the SLH into four more primitive constraints within the framework of the
Optimality Theory, as given in (9), among which Layeredness and Headedness

Semantic & pragmatic information Utt/ (Utterance)

(Discourse/focus-based hierarchy) IPh/ (IPh/ ) (Intonational Phrase)
Morphosyntactic information PPh/ (PPh/ ) (Phonological Phrase)
(Morpho-syntax-based hierarchy) CG (CG) (Clitic Group)
PW/ (PW/ ) (Phonological Word)
Phonological information () (Foot)
(Rhythm-based hierarchy) () (Syllable)
() (Mora)

Figure 3.1 Prosodic hierarchies

Source: Zhang 1992, 2017

Enclitics and the clitic group 85

are universally inviolable, while Exhaustivity and Non-​recursivity are not
observed by all languages.

(9) Constraints on prosodic domination

(where Cn = some prosodic category)
a. Layeredness: No Ci dominates a Cj, j > i,
e.g., “No syllable dominates a foot”.
b. Headedness: Any Ci must dominate a Ci−1 (except if Ci = syllable),
e.g., “A phonological word must dominate a foot”.
c. Exhaustivity: No Ci immediately dominates a constituent Cj, j < i –​1,
e.g., “No phonological word immediately dominates a syllable”.
d. Non-​recursivity: No Ci dominates Cj, j = i,
e.g., “No foot dominates a foot”.

Many examples of the violation of Exhaustivity and Non-​recursivity have

been found across languages (e.g., Ladd 1986; Hyman et al. 1987; Odden
1987; Inkelas 1989/​1990; Itô and Mester 1992/​2003; Zhang 1992, 2014, 2017;
Prince and Smolensky 1993; Truckenbrodt 1995, 1999; Vogel 2009, among
others). In addition, it has been noticed that Layeredness is not inviolable
either (Zhang 1992, 2017). On the basis of evidence from Chinese dialects
such as Chongming and Pingyao, Zhang suggests that in the trisected model
for prosodic hierarchy in Figure 3.1, there is no violation of the SLH among
prosodic units within different hierarchies, while violation of the SLH may
happen among prosodic units in the same hierarchy on a language-​specific
basis (see Zhang 1992, 2017, for more details). Definition of the clitic group and evidence for the clitic group domain
Based on the observation that certain phonological generalizations only apply
within the domain consisting of a word host and the clitic(s) in languages, the
string of the host plus the clitic(s) is treated as a unique prosodic constituent
in the prosodic hierarchy. This constituent is referred to as the clitic group, as
defined in (10).

(10) Clitic group (CG) formation (N&V 1986)

The domain of the CG consists of a ω containing an independent (i.e.,
non-​clitic) word plus any adjacent ωs containing
a. a directional clitic, or
b. a plain clitic/​non-​directional clitic such that there is no possible host
with which it shares more category memberships.9

Like other prosodic constituents, the clitic group has been reported to form
the domain for many phonological phenomena cross-​linguistically, which
constitutes the most substantial evidence for the existence of this constituent.
A typical case is Stress Assignment in Latin. According to N&V (1986), the

86 Shuxiang You
clitic group is a domain for this rule. Specifically, when an enclitic is attached
to a word, the primary stress is shifted from its original position within the
word to the syllable that immediately precedes the clitic, as exemplified in (11),
in which -​que ‘and’, interrogative -​ne, and -​cum ‘with’ are all enclitics.

(11) a. vírum ‘the man (acc.)’ virúmque ‘and the man (acc.)’
b. vídēs ‘you see’ vidḗsne? ‘do you see?’
c. cum vóbis ‘with you (pl.)’ vobíscum ‘with you (pl.)’

There are many other phonological phenomena applying within clitic

groups, but not across their boundaries or in other prosodic domains cross-​
linguistically, such as v-​Deletion and s, z-​Palatalization in English; Stress
Readjustment, Nasal Deletion, Nasal Assimilation, and Stop Voicing in
Greek; Stress Assignment and Vowel Harmony in Turkish; and t-​Deletion
in Catalan (see Hayes 1984/​1989; N&V 1986; Kabak and Vogel 2001, among
However, arguments against the existence of the clitic group as a prosodic
domain have been advanced, for example, (a) clitics may attach to constituents
higher than the prosodic word; (b) there is a lack of evidence for the clitic
group as a domain in some languages; and (c) clitics have to be given the
prosodic word status according to the definition presented in (10) to satisfy
the SLH (see Inkelas 1989/​1990; Inkelas and Zec 1995; Booij 1996; Selkirk
1996; Peperkamp 1997, among others). In responding to the objections and
problems with the original clitic group, Vogel (2009) argues that the problem
is not due to the clitic group itself but rather results from the SLH, and should
be resolved by assuming a slight weakening of the SLH.
As shown in the following sections, Fuzhou enclitics may attach to
constituents higher than the prosodic word,10 which would be viewed as
counterevidence against the existence of the clitic group according to some
previous studies mentioned in the last paragraph. Nonetheless, following
previous studies that argue for the clitic group, I assume in this study that
the clitic group is still part of the prosodic hierarchy. Moreover, I adopt the
weakened SLH and argue that the violation of Non-​recursivity is allowed in
Fuzhou, which is well supported by the evidence from this dialect.11

3.3 Enclitics in the Fuzhou dialect and their morphosyntactic

There are a number of enclitic-​like elements in Fuzhou. Although they have
very distinctive morphosyntactic and phonological behavior, these elements
are usually treated as suffixes or words in the literature (Chen and Norman
1965; Chan 1985; Chen 1998; Li and Liang 2001; Li 2002, among others).
Section 3.3 and Section 3.4 offer a comprehensive description and analysis of
the properties of these elements. These elements, as to be shown in Section 3.3
and Section 3.4, all belong to closed functional categories and share common

Enclitics and the clitic group 87

properties of enclitics across languages. As bound morphemes, they all have
to attach to adjacent prosodic units on their left, and are therefore actually

3.3.1 Possessive/​modificational/​nominalization marker 其 [ki0]

The most commonly used Fuzhou enclitic 其 [ki0] serves as the possessive/​
modificational/​nominalization marker. It is presented as [i0] in some previous
studies (e.g., Chen and Norman 1965; Chan 1985). However, within a sandhi
environment, when preceded by a syllable ending with the historical *-​k coda,
其 is pronounced with a stop initial k-​. According to the CL rule, citation
initials following the *-​k coda remain unchanged in a sandhi environment. If
the citation form of 其 has a zero initial, it should not have the stop initial k-​
when following the *-​k coda in a sandhi context. Therefore, the citation initial
of 其 must be the stop consonant k-​instead of the zero initial.
Like its counterpart de 的 in Mandarin Chinese, 其 [ki0] in Fuzhou can
be: (a) attached to the right of a noun/​pronoun to indicate possession, as in
(12); (b) attached to the modifier, connecting the modifier and the nominal
expression modified by the modifier, as in (13); and (c) used to make nouns
out of verbs/​verb phrases, adjectives, nouns/​noun phrases, or pronouns, as in
(14) (cf. Chen and Norman 1965; Li 2002, among others). 其 [ki0] is presented
as POSS (= possessive marker), MOD (= modificational marker), or NOM
(= nominalization marker) in the gloss. In the examples, the enclitics are
labeled with “C” and the prosodic words are labeled with “ω”. For the sake
of brevity, examples in Section 3.3 present only the citation/​underlying seg-
mental structure.

(12) Possessive marker

a. [[我]ω 其C]CG 书 b. [[依妹]ω 其C]CG 衣裳12
[[ŋuai] ki] tsy [[ʔi mui] ki] ʔi suoŋ
I POSS book younger sister POSS clothes
‘my book’ ‘younger sister’s clothes’

(13) Modificational marker

a. [[旧]ω 其C]CG 书 b. [[旧旧]ω 其C]CG 书
[[kou] ki] tsy [[kou kou] ki] tsy
old MOD book old old MOD book
‘old book’ ‘(very) old book’

(14) Nominalization marker

a. [[食]ω 其C]CG b. [[红]ω 其C]CG
[[siɛʔ] ki] [[ʔøyŋ] ki]
eat NOM red NOM
‘food’ ‘red thing(s)’

88 Shuxiang You

c. [[我]ω 其C]CG
[[ŋuai] ki]

3.3.2 Adjective reduplication markers 势 [siɛ213], 式 [seiʔ23],

and 喏 [luoʔ23]
Reduplicated adjectives in Fuzhou generally cannot be used as the predicate
on their own. When used as the predicate, they are bound on the right side and
thus need to take enclitics 势 [siɛ213], 式 [seiʔ23], or 喏 [luoʔ23] (cf. Chen 1998;
Li and Liang 2001; Li 2002, among others). Enclitics 势 [siɛ213], 式 [seiʔ23], and
喏 [luoʔ23] are freely interchangeable when attached to reduplicated adjectives.
Examples are given below, where these three enclitics are presented as AdjR
(= adjective reduplication marker) in the gloss.

(15) a. [[白白]ω 势/​式/​喏C]CG

[[paʔ paʔ] siɛ/​seiʔ/​luoʔ]
white white AdjR
‘rather white’
b. [[闲闲落落]ω 势/​式/​喏C]CG
[[ʔeiŋ ʔeiŋ loʔ loʔ] siɛ/​seiʔ/​luoʔ]
easy AdjR
‘very easy’

3.3.3 Aspect markers

It has long been recognized that Fuzhou has a number of aspect markers
occurring after the verb/​verb phrase (cf. Chen and Norman 1965; Chan 1985;
Chen 1998; Li and Liang 2001; Li 2002, among others). These post-​verbal
aspect markers are enclitics, which attach to the host on their left to indicate
the developmental status of the event or situation. Durative aspect marker 𠲥 [lɛ0]

𠲥 [lɛ0] is a versatile enclitic in Fuzhou, not only serving as the durative aspect
marker but also as the perfective aspect marker, the post-​verbal particle, and
the locative marker (to be discussed in more detail in relevant subsections). As
the durative aspect marker, 𠲥 [lɛ0] behaves like the durative aspect marker zhe
着 in Mandarin Chinese, occurring in the post-​verbal position to indicate a
continuing state or situation denoted by the verb. Verbs preceding 𠲥 [lɛ0] are
usually those denoting states or actions that can last for a certain amount of
time, as exemplified in (16).

Enclitics and the clitic group 89

(16) a. 门 [[关]ω 𠲥C]CG b. 伊 [[徛]ω 𠲥C]CG

mouŋ [[kuoŋ] lɛ] ʔi [[khiɛ] lɛ]
door close DUR he stand DUR
‘The door is closed.’ ‘He is standing.’

In addition, similar to zhe 着, the durative aspect marker 𠲥 [lɛ0] can occur
between two verbs. In the “V1 𠲥 V2” construction, 𠲥 [lɛ0] attaches to the pre-
ceding verb (V1) and indicates that the event denoted by the following verb
(V2) happens in the state of “V1-​ing”, as in (17). Moreover, 𠲥 [lɛ0] can be used
in an imperative sentence, as in (18).

(17) [[徛]ω 𠲥C]CG 等 (18) [[徛]ω 𠲥C]CG

[[khiɛ] lɛ] tiŋ [[khiɛ] lɛ]
stand DUR wait stand DUR
‘to wait standing’ ‘Stand (there)!’ Experiential aspect markers 过 [kuo213] and 着 [tuoʔ5]

Fuzhou enclitics 过 [kuo213] and 着 [tuoʔ5] are both experiential aspect
markers, whose morphosyntactic function is similar to that of the Mandarin
experiential aspect marker guo 过, indicating the past experience of the event
or action denoted by the preceding verb. According to Li and Liang (2001), 过
[kuo213] and 着 [tuoʔ5] are interchangeable in the Fuzhou dialect, and the only
difference between these two is that 着 [tuoʔ5] is more often used by the older
generations, while 过 [kuo213] is more often used by the younger generations.
Examples of 过 [kuo213] and 着 [tuoʔ5] are presented as follows.

(19) a. [[去]ω 过/​着C]CG 天津 b. [[食]ω 过/​着C]CG 鱼

[[kho] kuo/​tuoʔ] thiɛŋ kiŋ [[siɛʔ] kuo/​tuoʔ] ŋy
go EXP Tianjin eat EXP fish
‘to have been to Tianjin before’ ‘to have eaten fish before’ Perfective aspect marker 𠲥 [lɛ0]

There are two perfective aspect markers in Fuzhou, 𠲥 [lɛ0] and 去 [kho0] (cf.
Chen 1998; Feng 1998, among others). Similar to the perfective aspect marker
le 了 in Mandarin Chinese, both 𠲥 [lɛ0] and 去 [kho0] attach to the preceding
verb/​verb phrase and indicate the completion of actions. Their morphosyn-
tactic distributions, nevertheless, are not the same. 𠲥 [lɛ0] occurs after the
verb and is followed by other elements such as the object or the complement.13
In contrast, 去 [kho0] appears after a bare verb or a verb-​complement struc-
ture (cf. Chen 1998; see Section for more details). Examples of the
perfective aspect marker 𠲥 [lɛ0] are presented in (20).

90 Shuxiang You

(20) a. 伊 [[食]ω 𠲥C]CG 暝 去 睏

ʔi [[siɛʔ] lɛ] maŋ kho khouŋ
he eat PERF dinner go sleep
‘He went to sleep after eating the dinner.’
b. [[睏]ω 𠲥C]CG 大 半 日
[[khouŋ] lɛ] tuai puaŋ niʔ
sleep PERF big half day
‘slept for most of the day’ Perfective aspect marker 去 [kho0]

The other Fuzhou perfective aspect marker 去 [kho0] occurs right after a
bare verb or a verb-​complement structure, indicating the completion of the
action. If the verb originally takes an object that has to be mentioned in the
sentence, the object should be advanced to the topic position (Chen 1998).
According to Li (2002), this perfective aspect marker usually indicates an
unfavorable result of the action denoted by the verb/​verb-​complement struc-
ture, as in (21).

(21) a. 我 [[病]ω 去C]CG

ŋuai [[paŋ] kho]
I sick PERF
‘I am sick.’
b. 水缸 碰 [[必]ω 去C]CG
tsui kouŋ phouŋ [[peiʔ] kho]
water jar hit crack PERF
‘The water jar was hit and developed a crack.’ Sentence final particle 了 [lau31]

The sentence final particle 了 [lau31] in Fuzhou occurs at the end of a sen-
tence or a clause, indicating a change in the state or situation. Thus it by
and large corresponds to the Mandarin sentence final particle le 了, which
is considered as a perfect aspect marker indicating a change of state or a
currently relevant state (CRS) (Li and Thompson 1981; Sun 2006). 了 [lau31]
can be used as the only aspect marker in a sentence/​clause, as in (22). It can
also coexist with other aspect markers discussed in previous subsections, as
in (23) (Chen 1998). Note that the violation of Non-​recursivity is allowed in
examples in (23).

(22) a. 逿 [[雨]ω 了C]CG b. 暝 [[好]ω 了C]CG

touŋ [[ʔy] lau] maŋ [[xo] lau]
fall rain CRS dinner good CRS
‘It is raining.’ ‘The dinner is ready.’

Enclitics and the clitic group 91

(23) a. 门 [[[开]ω 𠲥C]CG 了C]CG

mouŋ [[[khui] lɛ] lau]
door open DUR CRS
‘The door is already open.’
b. 我 [[[看]ω 过C]CG 了C]CG
ŋuai [[[khaŋ] kuo] lau]
‘I have seen (that).’
c. 天 [[[暗]ω 去C]CG 了C]CG
thiɛŋ [[[ʔaŋ] kho] lau]
sky dark PERF CRS
‘The sky has become dark.’ Delimitative aspect marker 囇 [la242]

The enclitic 囇 [la242] in Fuzhou is used as the delimitative aspect marker. It
occurs on the right of the verb, indicating that a situation or event lasts only a
short time (Chen 1998; Li and Liang 2001; Li 2002, among others), as exem-
plified in (24).

(24) a. [[坐]ω 囇C]CG b. [[听]ω 囇C]CG

[[soy] la] [[thiaŋ] la]
sit DLM listen DLM
‘to sit awhile’ ‘to listen awhile’

3.4 Interrogative particles 无 [mo51], 未 [mui242], and 𣍐 [ma242]

There are three negative particles in Fuzhou that can be used as sentence-​final
interrogative particles, namely 无 [mo51], 未 [mui242], and 𣍐 [ma242]. As nega-
tive particles, they occur before the verb or verb phrase, with 无 [mo51] neg-
ating general actions or events, 未 [mui242] negating actions or events that have
occurred, and 𣍐 [ma242] negating the ability or possibility of doing something.
When placed at the end of questions, they are used with different functions,
which basically correspond to their functions as negatives, as exemplified in
(25). Interrogative particles are presented as “Qu” in the gloss.

(25) a. 伊 有 买 [[卵糕]ω 无C]CG ?

ʔi ʔou mɛ [[louŋ ko] mo]
he have buy cake Qu
‘Has he bought a cake?’
b. 伊 买 [[卵糕]ω 未C]CG ?
ʔi mɛ [[louŋ ko] mui]
he buy cake Qu
‘Did he buy a cake?’

92 Shuxiang You

c. 伊 会 买  [[卵糕]ω   𣍐C]CG ?
ʔi ʔa mɛ     [[louŋ ko]    ma]
he will buy   cake      Qu
‘Will he buy a cake?’

3.3.5 Post-​verbal particles

In addition to aspect markers discussed in Section 3.3.3, there are other three
enclitics that occur right after the verb in the Fuzhou dialect: 敆 [kaʔ0], 遘
[kau213], and 𠲥 [lɛ0]. These three enclitics do not indicate the developmental
status of the event. In order to distinguish them from the post-​verbal aspect
markers, they are named post-​verbal particles (PVP) in this chapter. Post-​verbal particle 敆 [kaʔ0]

Post-​verbal particles 敆 [kaʔ0] and 遘 [kau213] are different from typical
Fuzhou post-​verbal resultative complements such as 完 [ʔuoŋ51] ‘finish’ in
听完 [thiaŋ44 ʔuoŋ51] ‘finish listening’ and 饱 [pa31] ‘full’ in 食饱 [siɛʔ5 pa31] ‘have
eaten to one’s fill’. Resultative complements like 完 [ʔuoŋ51] and 饱 [pa31] never
undergo CL when occurring right after the verb. By contrast, when attached
to the verb, the enclitics 敆 [kaʔ0] and 遘 [kau213] usually undergo CL and have
their sandhi initials (to be discussed in detail in Section 3.4.2).
The enclitic 敆 [kaʔ0] occurs after the verb or verb phrase, introducing the
time and location of the action or event. Its function is thus similar to that
of zai 在 in Mandarin Chinese (cf. Chen 1998; Li and Liang 2001, among
others), as in (26).

(26) a. [[定]ω 敆C]CG 今旦 b. [[排]ω 敆C]CG 厅中

[[tiaŋ] kaʔ] kiŋ taŋ [[pɛ] kaʔ] thiaŋ touŋ
set PVP today put PVP drawing room
‘to be scheduled for today’ ‘to be put in the drawing room’ Post-​verbal particle 遘 [kau213]

遘 [kau213] in the Fuzhou dialect corresponds to the Mandarin dao 到 (cf.
Chen 1998, among others). As a post-​verbal particle, 遘 [kau213] has multiple
functions. “V-​遘” can be followed by object nouns/​noun phrases, place words,
time words, and even sentences/​clauses indicating the result/​degree, as exem-
plified in (27).

(27) a. [[收]ω 遘C]CG 批 b. [[行]ω 遘C]CG 厝

[[siu] kau] phiɛ [[kiaŋ] kau] tshuo
receive PVP letter walk PVP home
‘to receive a letter’ ‘to arrive home by walking’

Enclitics and the clitic group 93

c. [[等]ω 遘C]CG 十 点
[[tiŋ] kau] seiʔ teiŋ
wait PVP ten o’clock
‘to wait until ten o’clock’
d. [[做]ω 遘C]CG 逢侬 都 满意
[[tso] kau] xuŋ nøyŋ tu muaŋ ʔei
do PVP everyone all satisfied
‘to do (something) and make everyone satisfied’ Post-​verbal particle 𠲥 [lɛ0]

Unlike the durative aspect marker 𠲥 [lɛ0] and the perfective aspect marker
𠲥 [lɛ0], the post-​verbal particle 𠲥 [lɛ0] does not signify the aspect. Instead,
similar to the Mandarin descriptive complement marker de 得, it connects the
verb and the descriptive complement that indicates the result or manner of
the action, as in (28).

(28) a. [[看]ω 𠲥C]CG 野 清楚

[[khaŋ] lɛ] ʔia tshiŋ tshu
look PVP very clear
‘saw (something) very clearly’
b. [[跳]ω 𠲥C]CG 蜀 身 都 是 汗
[[thiu] lɛ] suoʔ siŋ tu sei kaŋ
jump PVP one body all be sweat
‘jump to be covered in sweat’

3.3.6 Locative marker 𠲥 [lɛ0]

The enclitic 𠲥 [lɛ0] can also serve as the locative marker, changing a regular noun
into a place word. Unlike Fuzhou localizers such as 里 [tiɛ31], 边 [piɛŋ44], and 斗
[tau213], which always cause the preceding syllable to undergo TS, the locative
marker 𠲥 [lɛ0] never triggers the application of TS on the preceding syllable (to
be discussed in detail in Section 3.4.1). Examples of 𠲥 [lɛ0] are presented below,
in which 𠲥 [lɛ0] may have slightly different meanings in different examples.

(29) a. [[面]ω 𠲥C]CG b. [[车]ω 𠲥C]CG

[[meiŋ] lɛ] [[tshia] lɛ]
face LOC car LOC
‘on the face’ ‘in the car’
c. [[书]ω 𠲥C]CG d. [[碗囝]ω 𠲥C]CG
[[tsy] lɛ] [[ʔuaŋ kiaŋ] lɛ]
book LOC small bowl LOC
‘in/​on the book’ ‘in the small bowl’

94 Shuxiang You

3.3.7 Recursive clitic group with enclitics

In addition to the examples in (23), there are other cases in which prosodic
recursivity is allowed, as can be seen in (30).

(30) a. 骹 [[[断]ω 了C]CG 其C]CG 侬

kha [[[touŋ] lau] ki] nøyŋ
leg break CRS MOD people
‘people whose legs were broken’
b. [[[收]ω 遘C]CG 其C]CG 批
[[[siu] kau]    ki] phiɛ
receive PVP   MOD letter
‘the letter that was received’

3.3.8 Summary
From data presented in Section 3.3, we can find that these enclitic-​like elem-
ents in the Fuzhou dialect share some of the most common morphosyntactic
properties of enclitics across languages: (a) they all belong to functional cat-
egories; (b) they never occur as the only element of an utterance and must
attach to the adjacent prosodic unit (ω or CG) on the left as the host; (c) the
meaning of the string of the host plus the enclitic is predictable from the
meaning of the host and that of the enclitic; and (d) they can attach to
material already containing the affix, as in (12b) and (29d), or the clitic, as in
(23) and (30). Therefore, according to the discussion in Section 3.2.1, it is rea-
sonable to consider these elements as enclitics. The group of the host plus the
enclitic thus forms a type of clitic group in this dialect. In Section 3.4, we will
see that there are phonological phenomena characteristic only of such a type
of clitic group in Fuzhou, which provides further evidence for the existence
of enclitics and the clitic group consisting of “host+enclitic” in this dialect.

3.4 Phonological phenomena and the clitic group consisting of

“host+enclitic” in the Fuzhou dialect
This section investigates the phonological behavior of the clitic group
composed of “host+enclitic” in Fuzhou, with respect to the application of
two Fuzhou phonological rules, TS and CL. I will show in this section that
there are phonological phenomena referring crucially to the clitic group
consisting of “host+enclitic” but not to any other context.

3.4.1 TS and the clitic group consisting of “host+enclitic” in the

Fuzhou dialect
As mentioned in Section 3.1, the TS rule in Fuzhou is triggered within some
lexical items and phrases while blocked in others. The clitic group consisting

Enclitics and the clitic group 95

of “host+enclitic” exhibits different behavior in terms of the application of
TS. It has been noticed that some Fuzhou elements never cause the tone of
the preceding syllable to undergo TS (Wright 1983; Chan 1985; Chen 1998; Li
2002, among others). Compare the first two examples in (31). For the sake of
brevity, only sandhi forms of tones are presented in Section 3.4.1.

(31) a. 旧 书 b.旧 其 书 c.坐 囇

kou242 tsy44 kou242 ki0 tsy44 soy242 la242
→ kou44 tsy44 → kou242 # ki0 tsy44 → soy242 # la242
old book old MOD book sit DLM
‘old book’ ‘old book’ ‘to sit awhile’

We can find that in (31a), TS applies between 旧 ‘old’ and 书 ‘book’ and
changes the tone of 旧 ‘old’, while TS is blocked in (31b), although these two
examples have similar morphosyntactic structure, namely the modifier-​head
Some linguists suggest that the blocking of TS in cases like (31b) can be
ascribed to the neutral tone carried by elements like 其 [ki0] (e.g., Chan 1985;
Li 2002; among others). Nevertheless, notice that 囇 [la242] in (31c) bears a non-​
neutral tone but also causes the blocking of TS, showing that the blocking of
TS cannot be simply ascribed to the tonal value.
Elements like 其 [ki0] and 囇 [la242] that can trigger the blocking of TS are
enclitics, according to the discussion in Section 3.3. Hence I assume that the
clitic group composed of “host+enclitic” in Fuzhou cannot form the domain of
application for TS. Specifically, TS is blocked between the host and the enclitic.
This assumption is well supported by Fuzhou data, as illustrated in (32–​38).

(32) Blocking of TS in “host+possessive/​modificational/​nominalization

marker其 [ki0]”
a. [[依妹]ω 其C]CG 衣裳 b. [[红]ω 其C]CG
[[ʔi44 mui213] ki0] ʔi44 suoŋ51 [[ʔøyŋ51] ki0]
→ [[ʔi51 mui213] # ki0] ʔi44 suoŋ51 → [[ʔøyŋ51] # ki0]
younger sister POSS clothes red NOM
‘younger sister’s clothes’ ‘red thing(s)’

(33) Blocking of TS in “host+adjective reduplication marker”

a. [[悬悬]ω 势C/​式C/​喏C]CG
[[keiŋ51 keiŋ51] siɛ213/​seiʔ23/​luoʔ23]
→ [[keiŋ31 keiŋ51] # siɛ213/​seiʔ23/​luoʔ23]
*[[keiŋ31 keiŋ21] siɛ213/​seiʔ23/​luoʔ23]
tall tall AdjR
‘rather tall’
b. [[闲闲落落]ω 势C/​式C/​喏C]CG
[[ʔeiŋ51 ʔeiŋ51 loʔ5 loʔ5] siɛ213/​seiʔ23/​luoʔ23]
→ [[ʔeiŋ21 ʔeiŋ21 loʔ31 loʔ5] # siɛ213/​seiʔ23/​luoʔ23]

96 Shuxiang You

*[[ʔeiŋ21 ʔeiŋ21 loʔ31 loʔ21] siɛ213/​seiʔ23/​luoʔ23]

easy AdjR
‘very easy’

(34) Blocking of TS in “host+aspect marker”

I. Host+durative aspect marker 𠲥 [lɛ0]
a. 门   [[关]ω        𠲥C]CG
mouŋ51 [[kuoŋ44]    lɛ0]
→ mouŋ51 [[kuoŋ44] # lɛ0]
door  close     DUR
‘The door is closed.’
II. Host+experiential aspect marker 过 [kuo213]/​着 [tuoʔ5]
a. [[去]ω 过C]CG 天津 b. [[食]ω 着C]CG 鱼
[[kho213]   kuo213]   thiɛŋ44 kiŋ44 [[siɛʔ5] tuoʔ5] ŋy51
→ [[k o ] # kuo ] t iɛŋ kiŋ
h 213 213 h 44 44
→ [[siɛʔ ] # tuoʔ ]
5 5
*[[k o ] kuo ]   t iɛŋ kiŋ
h 51 213 h 44 44
*[[siɛʔ ] tuoʔ ]
31 5
go    EXP    Tianjin eat EXP fish
‘to have been to Tianjin before’ ‘to have eaten fish before’
III. Host+perfective aspect marker 𠲥 [lɛ0]
a. [[睏]ω      𠲥C]CG 大   半      日
[[khouŋ213]    lɛ0]   tuai242   puaŋ213 niʔ5
→ [[khouŋ213] # lɛ0]   tuai21   puaŋ44    niʔ5
sleep     PERF      big    half     day
‘slept more than half of the day’
IV. Host+perfective aspect marker 去 [kho0]
a. 我    [[病]ω     去C]CG
ŋuai31    [[paŋ242]   kho0]
→ ŋuai31    [[paŋ242] #   kho0]
I     sick       PERF
‘I am sick.’
V. Host+sentence final particle 了 [lau31]
a. 逿 [[雨]ω 了C]CG
touŋ242 [[ʔy242] lau31]
→ touŋ 51
[[ʔy ] # lau31]

*touŋ 51
[[ʔy51] lau31]
fall rain CRS
‘It is raining.’
VI. Host+delimitative aspect marker 囇 [la242]
a. [[坐]ω 囇C]CG
[[soy242] la242]
→ [[soy ] # la242]

*[[soy51] la242]
sit DLM
‘to sit awhile’

Enclitics and the clitic group 97

(35) Blocking of TS in “host+interrogative particle”

a. 伊 有 [[去]ω 无C]CG? b. 伊 [[去]ω 未C]CG?
ʔi44 ʔou242 [[kho213] mo51] ʔi44 [[kho213] mui242]
→ ʔi44 ʔou51 [[kho213] # mo51] → ʔi44 [[kho213] # mui242]
*ʔi ʔou
44 21
[[k o ]
h 44
mo ]
*ʔi44 [[kho51] mui242]
he have go Qu he go Qu
‘Is he going?’ ‘Has he gone?’
c. 伊 会 [[去]ω 𣍐C]CG?
ʔi44 ʔa242 [[kho213] ma242]
→ ʔi44 ʔa51 [[kho213] # ma242]
*ʔi ʔa
44 21
[[kho51] ma242]
he will go Qu
‘Will he go?’

(36) Blocking of TS in “host+post-​verbal particle”

I. Host+post-​verbal article 敆 [kaʔ0]
a. [[定]ω 敆C]CG 今旦
[[tiaŋ242] kaʔ0] kiŋ44 taŋ213
→ [[tiaŋ ] # kaʔ ]
242 0
kiŋ51 taŋ213
set PVP today
‘to be scheduled for today’
II. Host+post-​verbal article 遘 [kau213]
a. [[收]ω 遘C]CG 批
[[siu44] kau213] phiɛ44
→ [[siu ] # kau ]
44 213
*[[siu ]
kau ]
receive PVP letter
‘to receive a letter’
III. Host+post-​verbal article 𠲥 [lɛ0]
a. [[看]ω     𠲥C]CG 野   清楚
[[khaŋ213]   lɛ0]    ʔia31    tshiŋ44 tshu31
→ [[khaŋ213] #    lɛ0]    ʔia21    tshiŋ51 tshu31
look        PVP    very     clear
‘saw (something) very clearly’

(37) Blocking of TS in “host+locative marker 𠲥 [lɛ0]”

a. [[面]ω 𠲥C]CG
[[meiŋ213] lɛ0]
→ [[meiŋ213] # lɛ0]
face LOC
‘on the face’

98 Shuxiang You

(38) Blocking of TS in recursive clitic group with enclitics

a. 门 [[[开]ω 𠲥C]CG 了C]CG
mouŋ51 [[[khui44] lɛ0] lau31]
→ mouŋ 51
[[[k ui ] # lɛ ] #
h 44 0
door open DUR CRS
‘The door is already open.’
b. 我 [[[看]ω 过C]CG 了C]CG
ŋuai31 [[[khaŋ213] kuo213] lau31]
→ ŋuai 31
[[[kaŋ213] # kuo213]# lau31]
*ŋuai 31
[[[khaŋ21] kuo51] lau31]
‘I have seen (that).’

From the examples in (32–​38), we can find that TS consistently fails

to apply within the domain formed by the clitic group composed of
“host+enclitic” in the Fuzhou dialect. Therefore, one distinctive phono-
logical property of this type of clitic group is the obligatory blocking of
TS between the host and the enclitic CG-​internally, which is distinct from
lexical items and phrases.

3.4.2 CL and the clitic group consisting of “host+enclitic” in

the Fuzhou dialect
The CL rule, as discussed in Section 3.1, is not an obligatory rule in the
domain formed by lexical items and phrases either. In contrast, CL con-
sistently applies within the domain formed by the clitic group consisting of
“host+enclitic” in Fuzhou. It has been reported that some elements in this
dialect always undergo CL (Chen 1998; Li and Liang 2001; Li 2002, among
others). Compare two examples in (39):

(39) a. 买 锅 b.买 过
mɛ31 kuo44 mɛ31 kuo213
→ mɛ21 # kuo44 → mɛ31 ʔuo213
*mɛ21 ʔuo44 buy EXP
buy pan ‘to have bought (something)’
‘to buy a pan’

(39a) and (39b) share the same phonological environment, namely, an

open syllable followed by the initial k-​. However, CL does not apply in (39a),
with the initial k-​ of 锅 ‘pan’ remaining unchanged, while the experiential
aspect marker 过 in (39b) undergoes CL, with the initial k-​changed to the
glottal stop.
Elements like 过 [kuo213] in (39b) are enclitics in Fuzhou, according to the
discussion in Section 3.3. A thorough investigation of Fuzhou data reveals

Enclitics and the clitic group 99

that the clitic group consisting of “host+enclitic” serves as a domain of appli-
cation for CL in Fuzhou, as shown in (40–​45).14

(40) Application of CL in “host+possessive/​modificational/​nominalization

marker 其 [ki0]”
a. [[我]ω 其C]CG 书 b. [[红]ω   其C]CG
[[ŋuai31] ki0] tsy44 [[ʔøyŋ51] ki0]
→ [[ŋuai31] ʔi0] tsy44 → [[ʔøyŋ51] ŋi0]
I POSS book red     NOM
‘my book’ ‘red thing(s)’

(41) Application of CL in “host+adjective reduplication marker”

a. [[悬悬]ω 势C/​式C/​喏C]CG
[[keiŋ51 keiŋ51] siɛ213/​seiʔ23/​luoʔ23]
→ [[keiŋ31 keiŋ51] niɛ213/​neiʔ23/​nuoʔ23]
tall tall AdjR
‘rather tall’
b.[[舒舒畅畅]ω 势C/​式C/​喏C]CG
[[tshy44 tshy44 thuoŋ213 thuoŋ213] siɛ213/​seiʔ23/​luoʔ23]
→ [[tshy21 ʒy21 luoŋ51 nuoŋ213] niɛ213/​neiʔ23/​nuoʔ23]
comfortable AdjR
‘very comfortable’

(42) Application of CL in “host+aspect marker”

I. Host+durative aspect marker 𠲥 [lɛ0]
a. 门   [[关]ω    𠲥C]CG
mouŋ51 [[kuoŋ44] lɛ0]
→ mouŋ51 [[kuoŋ44] nɛ0]
door   close       DUR
‘The door is closed.’
II. Host+experiential aspect marker 过 [kuo213]/​着 [tuoʔ5]
a. [[去]ω   过C]CG 天津
[[kho213] kuo213] thiɛŋ44 kiŋ44
→ [[kho213] ʔuo213]  thiɛŋ44 ŋiŋ44
go    EXP    Tianjin
‘to have been to Tianjin before’
b. [[办]ω   着C]CG 护照
[[paiŋ242]  tuoʔ5]  hou242 tsiu213
→ [[paiŋ242]   nuoʔ5]   hou51 ʒiu213
do       EXP    passport
‘to have applied for a passport’

100 Shuxiang You

III. Host+perfective aspect marker 𠲥 [lɛ0]

a. [[睏]ω      𠲥C]CG 大      半    日
[[khouŋ213] lɛ0]    tuai242 puaŋ213 niʔ5
→ [[khouŋ213] nɛ0]    tuai21    βuaŋ44    niʔ5
sleep      PERF big    half      day
‘slept more than half of the day’
IV. Host+perfective aspect marker 去 [kho0]
a. 我       [[病]ω   去C]CG
ŋuai31   [[paŋ242] kho0]
→ ŋuai31   [[paŋ242] ŋo0]
I    sick       PERF
‘I am sick.’
V. Host+sentence final particle 了 [lau31]
a. 伊 生        [[囝]ω 了C]CG
ʔi44 saŋ44    [[kiaŋ31] lau31]
→ ʔi44 saŋ51    [[kiaŋ31] nau31]
she give birth child    CRS
‘She has given birth to a child.’
VI. Host+delimitative aspect marker 囇 [la242]
a. [[听]ω    囇C]CG
[[thiaŋ44]    la242]
→ [[thiaŋ44]    na242]
listen       DLM
‘to listen awhile’

(43) Application of CL in “host+post-​verbal particle”

I. Host+post-​verbal article 敆 [kaʔ0]
a. [[定]ω 敆C]CG 今旦
[[tiaŋ242] kaʔ0]    kiŋ44 taŋ213
→ [[tiaŋ242] ŋaʔ0]   kiŋ51 naŋ213
set   PVP     today
‘to be scheduled today’
II. Host+post-​verbal article 遘 [kau213]
a. [[等]ω 遘C]CG 十     点
[[tiŋ31] kau213]  seiʔ5 teiŋ31
→ [[tiŋ31] ŋau213] seiʔ31   teiŋ31
wait     PVP      ten    o’clock
‘to wait until ten o’clock’
III. Host+post-​verbal article 𠲥 [lɛ0]
a. [[看]ω   𠲥C]CG 野    清楚
[[khaŋ213]  lɛ0]     ʔia31 tshiŋ44 tshu31
→ [[khaŋ213]   nɛ0]    ʔia21 tshiŋ51 ʒu31
look       PVP      very clear
‘saw (something) very clearly’

Enclitics and the clitic group 101

(44) Application of CL in “host+locative marker 𠲥 [lɛ0]”

a. [[面]ω 𠲥C]CG
[[meiŋ213] lɛ0]
→ [[meiŋ ] 213
face LOC
‘on the face’

(45) Application of CL in recursive clitic group with enclitics

a. 骹 [[[断]ω 了C]CG 其C]CG 侬
kha44 [[[touŋ31] lau31] ki0] nøyŋ51
→ ka h 44
[[[touŋ ] nau ]
31 31
ʔi ]
leg break CRS MOD people
‘people whose legs were broken’
b. [[[收]ω  遘C]CG 其C]CG 批
[[[siu44] kau213] ki0]      phiɛ44
→ [[[siu44] ʔau213]   ʔi0]       phiɛ44
receive   PVP     MOD   letter
‘the letter that was received’

Empirical evidence presented in (40–​45) suggests that CL consistently applies

between the host and the enclitic within the clitic group domain composed of
“host+enclitic” in the Fuzhou dialect. Since CL is not an obligatory rule in lex-
ical items and phrases, the mandatory application of CL is another distinctive
phonological property of the clitic group domain consisting of “host+enclitic”
in Fuzhou. Due to the fact that Fuzhou enclitics can never stand alone and that
the application of CL is mandatory between the host and the enclitic, the syl-
lable initial of an enclitic is always decided by the final of the preceding syllable,
showing that Fuzhou enclitics are phonologically dependent.

3.4.3 Summary
The application/​blocking of TS and CL in lexical items, phrases, and the
clitic group domain composed of “host+enclitic” in Fuzhou is summarized
in Table 3.1. ‘√’ denotes the application of the rule, while ‘×’ indicates that the
rule is blocked even though there is an appropriate environment. ‘√/​×’ signifies
that the application is not obligatory.

Table 3.1 Phonological rules and different constructions in the Fuzhou dialect

Constructions Lexical items Phrases CG (host+enclitic)

TS √/​× √/​× ×
CL √/​× √/​× √

102 Shuxiang You

We can find that the clitic group composed of “host+enclitic” differs
from lexical items and phrases regarding their phonological properties in
Fuzhou. In lexical items and phrases, the application and blocking of TS and
CL are quite complex –​they apply in some strings of sounds, while they are
blocked in others. By contrast, their behavior in the clitic group consisting of
“host+enclitic” is clear-​cut –​TS is obligatorily blocked, while CL mandatorily
applies between the host and the enclitic. This demonstrates that there are
phonological phenomena characteristic only of the clitic group composed of
“host+enclitic” in Fuzhou and hence the clitic group should be established as
an indispensable prosodic constituent in this dialect.

3.5 Violation of SLH caused by the clitic group consisting of

“host+enclitic” in the Fuzhou dialect
As mentioned in Section, the SLH in prosodic phonology has been
challenged by evidence from various languages. Many examples of the viola-
tion of Exhaustivity, Non-​recursivity, and even Layeredness have been found
across languages. In the Fuzhou dialect, we have also seen examples that
violate Non-​recursivity, as in (23), (30), (38), and (45). Examples in (45) are
re-​presented below.

(46) a. 骹 [[[断]ω 了C]CG 其C]CG 侬

kha44 [[[touŋ31] lau31] ki0] nøyŋ51
→ ka h 44
[[[touŋ ] nau ]
31 31
ʔi0] nøyŋ51
leg break CRS MOD people
‘people whose legs were broken’
b. [[[收]ω 遘C]CG 其C]CG 批
[[[siu44] kau213]   ki0]     phiɛ44
→ [[[siu44] ʔau213]      ʔi0]     phiɛ44
receive PVP    MOD   letter
‘the letter that was received’

(46) shows that a clitic group composed of “host+enclitic” in Fuzhou may

dominate another clitic group of the same type. Take (46a) as an example. In
the internal clitic group, the enclitic 了 [lau31] attaches to the prosodic word 断
‘break’ as the host, and in the external clitic group, the enclitic 其 [ki0] attaches
to the internal clitic group [断了]CG as the host.
Examples like those in (46) are clearly cases of the violation of Non-​
recursivity, which constitutes a great challenge to the SLH. Nonetheless,
this can be accounted for by assuming a weakened SLH that allows pros-
odic recursivity in a given language. Thus, it would not be a problem if a
clitic group has another clitic group as the host. The domain formation of
the clitic group composed of “host+enclitic” in Fuzhou thus can be given
in (47).

Enclitics and the clitic group 103

(47) Clitic group (CG) (host+enclitic) formation in the Fuzhou dialect

The domain of the CG (host+enclitic) consists of one independent (i.e.,
non-​clitic) prosodic constituent (ω or CG), plus any adjacent
a. directional enclitic, or
b. plain enclitic/​non-​directional enclitic such that there is no possible
host with which they share more category memberships.

By so doing, the problem caused by the attachment of enclitics to

constituents higher than the prosodic word (in this case, the clitic group) in
the Fuzhou dialect can be nicely captured. This problem is not due to the clitic
group itself, but only due to the restrictions imposed by the SLH, as suggested
by Vogel (2009). The problem can be resolved by resorting to a weakened SLH
with no undesirable theoretical consequences, which further substantiates the
idea that a weakened SLH is required in the theory of prosodic phonology.

3.6 Conclusion
Based on the discussions in previous studies on clitics and the clitic group
across languages, this study presents a thorough investigation of enclitics and
the clitic group consisting of “host+enclitic” in Fuzhou, from the perspectives
of morphosyntactic functions and phonological behavior. The following
properties of enclitics and the clitic group consisting of “host+enclitic” in
Fuzhou have been identified:

(48) Properties of enclitics in the Fuzhou dialect

a. Fuzhou enclitics all belong to functional categories;
b. Fuzhou enclitics never occur as the only element of an utterance and
must attach to the adjacent prosodic unit (ω or CG) as the host;
c. The meaning of the string of the host plus the enclitic is predictable
from the meaning of the host and that of the enclitic;
d. Fuzhou enclitics can attach to material already containing the affix or
the clitic;
e. Fuzhou enclitics are phonologically dependent –​the initial of an
enclitic is always decided by the final of the preceding syllable.

(49) Properties of the clitic group consisting of “host+enclitic” in the Fuzhou

a. TS is obligatorily blocked between the host and the enclitic.
b. CL obligatorily applies between the host and the enclitic.

Thus we can find that, on the one hand, enclitic-​like elements in Fuzhou
reported in the literature are indeed enclitics, since they share common prop-
erties with enclitics in other languages. On the other hand, the group of

104 Shuxiang You

“host+enclitic” in this dialect does have peculiar phonological behavior as
compared to lexical items and phrases.
By establishing a prosodic constituent that contains the host plus the enclitic
in the Fuzhou dialect, I distinguish the “host+enclitic” group from lexical items
and phrases. I have thus accounted for the phonological behavior exhibited
by the group of “host+enclitic”, part of which has been noticed in previous
studies with no further explanation. The distinctive phonological behavior of
the “host+enclitic” group in Fuzhou, in turn, provides evidence and motiv-
ation for the existence of the clitic group within the prosodic hierarchy.
Moreover, a Fuzhou clitic group composed of “host+enclitic” can dom-
inate another clitic group. This indicates that the violation of Non-​recursivity
is allowed in this dialect, which can be accounted for by assuming a weakened
SLH, instead of excluding the clitic group from the prosodic hierarchy.
Therefore, the cases of Fuzhou enclitics and the clitic group consisting of
“host+enclitic” provide evidence for not only the existence of the clitic group
but also the necessity of a weakened SLH.

1 I would like to thank Prof. Hongming Zhang, who has persistently encouraged
and pushed me during my writing of this chapter. His valuable comments and
suggestions have greatly improved the quality of this chapter. An earlier version of
this chapter was presented at the 24th Columbia University Graduate Conference
on East Asia in New York, NY, February 2015. Thanks go to the audience at the
conference for various questions and comments. I would also like to thank my
informants, Mr. Dexing Chen, Mrs. Ling Chen, and Mrs. Liping Song, for their
patience and support during my fieldwork in Fuzhou in 2016 and their suggestions
and comments despite the physical distance ever since 2015. Of course, any
remaining errors in this chapter are mine.
2 Sound changes to finals in Fuzhou is a tonally conditioned phonological process –​
it occurs in cases where tone sandhi occurs and is blocked whenever tone sandhi
is blocked (cf. Chen and Norman 1965; Chan 1985; Chen 1998, among others).
I assume that the domain of application for sound changes to finals should be the
same as the domain of application for tone sandhi. Hence, for the sake of brevity,
sound changes to finals will not be presented and discussed in this chapter.
3 The tone sandhi behavior of lexical items formed through reduplication like
(3) is conditioned by another Fuzhou rule, which is referred to as Morphological
Tone Sandhi in You (2017). Please see Chen and Norman (1965), Chen (1998),
You (2017), among others, for more details.
4 The complex tone sandhi behavior of phrasal-​level constructions exhibited by the con-
trast between (2) and (4) has long been a problem for linguists. Readers are referred
to Chen and Norman (1965), Chan (1980, 1985), Wright (1983), Shih (1986), Hung
(1987), Zhang (1992), Chan (1998), You (2017), among others, for different analyses.
5 For detailed discussion on the blocking of CL in lexical items like (7), please see
You (2017).
6 The application/​blocking of CL in phrasal-​level constructions is another long-​
standing problem for linguists. Please see Chen and Norman (1965), Chan (1985),
Shih (1986), You (2017), among others, for different analyses.

Enclitics and the clitic group 105

7 Fuzhou has both proclitics and enclitics, and the Fuzhou clitic group can be divided
into two types according to the internal prosodic structure. This chapter does not
examine Fuzhou proclitics and the clitic group composed of “proclitic+host”, as
this topic will be discussed in a future study.
8 These two terms are interchangeable in the theory of prosodic phonology.
9 In N&V’s (1986) terminology, a directional clitic refers to a clitic phonologically
dependent on an element to the left or right according to its own inherent prop-
erty. A plain/​non-​directional clitic, in contrast, refers to a clitic that finds its host
either to the right or to the left.
10 For detailed discussions on the prosodic word in Fuzhou, please see You (2017).
11 Prof. Hongming Zhang insightfully points out that the violation of the SLH
I observed in the cases of the clitic group consisting of “host+enclitic” in the
Fuzhou dialect can be nicely handled by assuming that Non-​recursivity is violable
in this dialect. I thank Prof. Zhang for pointing this out to me.
12 Elements like 依-​(12b) and -​囝 (29d) are affixes in the Fuzhou dialect. Thus we can
find that a Fuzhou enclitic can attach to a string of sounds that already contains
an affix.
13 Note that the complement here is different from the term “complement” used
under the X-​bar framework of syntax. In syntactic theory, the term “complement”
often refers to the sister node of the head, and hence in the case of verb phrases,
the complement is actually the object of the head verb. In contrast, the “comple-
ment” here usually indicates the manner, the result, or the duration of the action
denoted by the verb.
14 Examples of “host+interrogative particle” are not presented in this subsection
since all the three interrogative particles have the initial m-​, which always remains
unchanged in a CL context.

Booij, G. (1983) “Principles and parameters in prosodic phonology”, Linguistics,
21(1), pp. 249–​280.
Booij, G. (1985) “The interaction of phonology and morphology in prosodic phon-
ology” in Gussmann, E. (ed.) Phono-​morphology: Studies in the interaction of phon-
ology and morphology. Lublin: Katolicki Universytet Lubelski, pp. 23–​34.
Booij, G. (1996) “Cliticization as prosodic integration: The case of Dutch”, Linguistic
Review, 13(3–​4), pp. 219–​242.
Chan, L.-​L. L. (1998) Fuzhou tone sandhi. Ph.D. Diss., University of California
San Diego.
Chan, M. K.-​M. (1980) Syntax and phonology interface: The case of tone sandhi in
the Fuzhou dialect of Chinese. MS Thesis, University of Washington.
Chan, M. K.-​M. (1985) Fuzhou phonology: A non-​linear analysis of tone and stress.
Ph.D. Diss., University of Washington.
Chen, L., and Norman, J. (1965) An introduction to the Foochow dialect. San
Francisco: San Francisco State College.
Chen, M. Y. (1985) The syntax of Xiamen tone sandhi. MS, University of California
San Diego.
Chen, M. Y. (1987) “The syntax of Xiamen tone sandhi”, Phonology Yearbook, 4, pp.
Chen, Z.-​ P. (1998) Fuzhou Fangyan Yanjiu [A study of the Fuzhou dialect].
Fuzhou: Fujian People’s Publishing.

106 Shuxiang You

Crystal, D. (2008) A dictionary of linguistics and phonetics. 6th edition. Malden, MA;
Oxford: Blackwell.
Feng, A.-​Z. (1998) Fuzhou Fangyan Cidian [The dictionary of the Fuzhou dialect].
Nanjing: Jiangsu Education Publishing.
Haspelmath, M., and Sims, A. D. (2010) Understanding morphology. London: Hodder
Hayes, B. (1984/​1989) “The prosodic hierarchy in meter” in Kiparsky, P., and Youmans,
G. (eds.) Rhythm and meter. Orlando, FL: Academic Press, pp. 201–​260.
Hung, T. T.-​N. (1987) Syntactic and semantic aspects of Chinese tone sandhi. Ph.D.
Diss., University of California San Diego.
Hyman, L. M., Katamba, F., and Walusimbi, L. (1987) “Luganda and the strict layer
hypothesis”, Phonology Yearbook, 4, pp. 87–​108.
Inkelas, S. (1989) Prosodic Constituency in the Lexicon. Ph.D. Diss., Stanford
University, Stanford. (Published 1990, Outstanding Dissertations in Linguistics
Series. New York: Garland Publishing.)
Inkelas, S., and Zec, D. (1995) “Syntax-​phonology interface” in Goldsmith, J. (ed.) The
handbook of phonological theory. Malden, MA; Oxford: Blackwell, pp. 535–​549
Itô, J., and Mester, A. (1992/​2003) “Weak layering and word binarity” in Honma,
T., et al. (eds.) A new century of phonology and phonological theory: A fest-
schrift for Professor Shosuke Haraguchi on the occasion of his sixtieth birthday.
Tokyo: Kaitakusha, pp. 26–​65.
Kabak, B., and Vogel, I. (2001) “The phonological word and stress assignment in
Turkish”, Phonology, 18(3), pp. 315–​360.
Klavans, J. L. (1982) Some problems in a theory of clitics. Bloomington: Indiana
University Linguistics Club.
Ladd, D. R. (1986) “Intonational phrasing: The case for recursive prosodic structure”,
Phonology Yearbook, 3, pp. 311–​340.
Li, C. N., and Thompson, S. A. (1981) Mandarin Chinese: A functional reference
grammar. Berkeley: University of California Press.
Li, R.-​L., and Liang, Y.-​Z. (2001) Fuzhou Fangyan Zhi [A record of the Fuzhou dia-
lect]. Fuzhou: Haifeng chubanshe.
Li, R.-​L., Liang, Y.-​Z., Zou, G.-​C., and Chen, Z.-​P. (1994) Fuzhou Fangyan Cidian
[The dictionary of the Fuzhou dialect]. Fuzhou: Fujian renmin chubanshe.
Li, Z.-​Q. (2002) Fuzhou phonology and grammar. Hyattsville: Dunwoody Press.
Nespor, M., and Vogel, I. (1986) Prosodic phonology. Dordrecht: Foris.
Odden, D. (1987) “Kimatuumbi phrasal phonology”, Phonology Yearbook, 4,
pp. 13–​26.
Peperkamp, S. (1997) Prosodic words. Ph.D. Diss., University of Amsterdam.
Prince, A., and Smolensky, P. (1993) Optimality theory: Constraint interaction in gen-
erative grammar. Cambridge, MA: MIT Press.
Selkirk, E. (1978/​1981) “On prosodic structure and its relation to syntactic structure”,
in Fretheim, T. (ed.) Nordic prosody II. Trondheim: Tapir, pp. 111–​140.
Selkirk, E. (1984) Phonology and syntax: The relation between sound and structure.
Cambridge, MA: MIT Press.
Selkirk, E. (1986) “On derived domain in sentence phonology”, Phonology Yearbook,
3, pp. 371–​405.

Enclitics and the clitic group 107

Selkirk, E. (1996) “The prosodic structure of function words” in Morgan, J. L., and
Demuth, K. (eds.) Signal to syntax: Bootstrapping from speech to grammar in early
acquisition. Mahwah, NJ: Lawrence Erlbaum Associates, pp. 187–​214.
Shih, C.-​L. (1986) The prosodic domain of tone sandhi in Chinese. Ph.D. Diss.,
University of California San Diego.
Sun, C.-​ F. (2006) Chinese: A linguistic introduction. Cambridge; New York,
NY: Cambridge University Press.
Truckenbrodt, H. (1995) Phonological phrase: Their relation to syntax, focus and
prominence. Ph.D. Diss., Massachusetts Institute of Technology.
Truckenbrodt, H. (1999) “On the relation between syntactic phrases and phonological
phrases”, Linguistic Inquiry, 30(2), pp. 219–​255.
Vogel, I. (2009) “The status of the clitic group” in Grijzenhout, J., and Kabak, B.
(eds.) Phonological domains: Universals and deviations. Berlin: Mouton de Gruyter,
pp. 15–​46.
Wright, M. S. (1983) A metrical approach to tone sandhi in Chinese dialects. Ph.D.
Diss., University of Massachusetts, Amherst.
You, S.-​X. (2017) Prosodic phonology of the Fuzhou dialect. Ph.D. Diss., University
of Wisconsin-​Madison.
Zec, D. (1988) Sonority constraints on prosodic structure. Ph.D. Diss., Stanford
Zhang, H.-​M. (1992) Topics in Chinese phrasal tonology. Ph.D. Diss., University of
California San Diego.
Zhang, H.-​M. (2014) “Yunlu yinxixue yu Hanyu yunlu yanjiu zhong de ruogan wenti”
[Some issues on prosodic phonology and Chinese prosodic studies], Dangdai
Yuyanxue [Contemporary Linguistics], 16(3), pp. 303–​327.
Zhang, H.-​M. (2017) Syntax-​phonology interface: Argumentation from tone sandhi in
Chinese dialects. London; New York: Routledge.
Zheng, Y.-​D. (1988) “Fuzhou fangyan ‘li’ de cixing jiqi yongfa” [Part of speech and
usages of “lɛ” in the Fuzhou dialect], Zhongguo Yuwen [Studies of the Chinese lan-
guage], 6, pp. 450–​452.
Zwicky, A. M. (1977) On clitics. Bloomington: Indiana University Linguistics Club.
Zwicky, A. M., and Pullum, G. (1983) “Cliticization vs. inflection: English n’t”,
Language, 59(3), pp. 502–​513.

Part II

Prosodic patterns

Geographical clines in the realization
of intonation in the Netherlands
Judith Hanssen, Carlos Gussenhoven,
and Jörg Peters

4.1 Introduction
Geography is one of the explanatory factors of phonetic variation in speech
(cf. Britain 2013). The realization of intonation contours has recently been
shown to follow a geographical cline from the southwest to the northeast
of the Netherlands, with a continuation to the low Saxon dialect of Weener
across the border in Germany (Peters et al. 2014, 2015). Earlier, Gilles
(2005: 165) suggested that pitch excursions of f0 falls in varieties of German
are larger in the west than in the east of Germany, on the basis of limited
data. In these cases, the variation concerns realizational differences in Ladd’s
(2008: 116) terms, that is, differences in the phonetic realization of compar-
able phonological forms.
The realization of intonation contours may differ in more general ways
than in function of contextual factors, like the segmental composition of the
accented syllable, upcoming word boundaries, or focus. An example is peak
timing in English, which is earlier than in Dutch and German, and later in
southern German than in northern German (Atterer and Ladd 2004; Ladd
et al. 2009; Mücke et al. 2009), while Kügler (2007) reported later f0 peaks in
the southern Swabian variety of German than in the eastern Upper Saxon
variety. Dialectal variation in tonal timing has also been reported for var-
ieties of Lowland Scots1 (van Leyden 2004), German (Peters 1999; Gilles
2005), American English (Arvaniti and Garding 2007), Irish (Kalaldeh et al.
2009), and British English (Ladd et al. 2009). Second, pitch excursion size and
overall pitch level equally show regional variation. Belgian women speak at a
higher pitch than Dutch women (van Bezooijen 1993), and Gilles (2005: 165)
reported variation in f0 excursion size of falling contours between speakers of
eight varieties of German. Ulbrich (2005) reported differences in pitch range
between speakers of two standard varieties of German (Swiss and Northern
German).2 Finally, the dialects spoken on the Orkney and Shetland islands
differ in overall pitch level, with intonation contours in the Orkney variety
being realized at a higher pitch (van Leyden 2004).
For Dutch, dialectal characteristics had until recently only been described
informally (van Es 1935; Daan 1938; Weijnen 1966). Two studies have

112 Judith Hanssen et al.

now added support to geographical clines on the basis of comparative
measurements on Zeelandic and Hollandic Dutch, West Frisian, Dutch
and German Low Saxon, and Northern High German. Peters et al. (2014)
investigated the effect of focus domain sizes smaller than the word on the real-
ization of non-​final falling nuclear contours and Peters et al. (2015) examined
the effects of word boundary location on tonal timing of non-​final nuclear
falls. Besides reporting effects, the authors described more general dialectal
differences in contour realization. For a number of phonetic variables, an
inverted U-​shaped cline was observed. Overall, the “central” varieties took
more time to realize the f0 movements, which resulted in larger f0 excursions,
higher peaks, and later alignment of the pitch gesture with the segmental
string. The accentual gestures of the peripheral varieties, on the other hand,
were more compact, both in terms of duration and excursion. Apparently,
the phonetic realization of nuclear falls in these varieties is determined more
strongly by geographical proximity than by their linguistic grouping.
Compared to non-​final falls, final falls may be realized with longer seg-
mental durations due to final lengthening (e.g., Wightman et al. 1992;
Gussenhoven and Rietveld 1992), earlier f0 peaks (e.g., Steele 1986; Prieto
et al. 1995; Peters 1999), and steeper or shorter f0 falls (e.g., Grabe 1998). This
chapter intends to expand on the data by Peters et al. (2014, 2015) by investi-
gating whether we can replicate the finding of a geographical cline in the real-
ization of non-​final nuclear falling contours and whether it is also found for
IP-​final nuclear falls and fall-​rises. Since the peripheral varieties of German
Low Saxon and Northern High German are not included in our data set, we
actually expect to find only part of the inverted U-​shape, that is, a truncated
one. We will report dialectal differences in segmental duration as well as tonal
timing, pitch excursion, pitch slope, and overall pitch level.

4.2 Procedure

4.2.1 Materials
We used three sets of sentences. The first set contained four declarative narrow-​
focus carrier sentences with a non-​final falling pitch accent (nf-​FALL); the
second set contained four declarative narrow-​focus carrier sentences with an
IP-​final falling pitch accent (f-​FALL); and the last set contained four rhet-
orical questions with an IP-​final falling-​rising pitch accent (f-​FR). All 12 car-
rier sentences (labeled “B”) were preceded by a context sentence (“A”) with
which they formed a mini-​dialogue, as illustrated in Table 4.1. In the non-​final
declaratives, the target words consisted of fictitious place names, Momberen,
Memberen, Manderen, Munderen,3 which had the metrical pattern sww, in
which the segmental structure of the accentable first syllable was Nasal-​V-​
Nasal, followed by a voiced plosive onset consonant. They were followed by
a sequence of two sw verbs. In the carrier sentences for the accentable IP-​
final position, four fictitious monosyllabic proper names, Lof, Loof, Lom,

Geographical clines 113

Table 4.1 Dutch context sentences and experimental sentences used to elicit non-​final
falls, final falls, and final fall-​rises, with English translations

Context sentence Carrier sentence

nf-​FALL Waar zouden je oom en tante Ze zouden bij MANDEREN

willen wonen? willen wonen.
Where would your uncle and aunt They’d like to live near Manderen.
want to live?
f-​FALL Met wie gaat je baas morgen Hij trouwt met mevrouw de
trouwen? LOOM.
Who will your boss marry He’ll marry Mrs. de Loom.
f-​FR Dit antieke horloge is nog van Het was toch van opa LOOM?
opa Thijssen geweest.
This antique wristwatch used to But didn’t it belong to grandfather
belong to grandfather Thijssen. Loom?

Note: The target sentences are printed in bold; the word carrying the nuclear pitch accent is

Loom, were used as target words in each pragmatic condition. These varied
in the rime only, where short [ɔ] and long [oː] combined with voiceless [f]‌and
sonorant [m].
A slightly modified version of these sentences was used to collect the
Standard Dutch data. The sentences shown in Table 4.1 were used for
Zuid-​Beveland, Rotterdam, and Amsterdam. Speakers from Zuid-​Beveland
translated the sentences into their variety as they spoke. We translated the
sentences into the local language for speakers of West Frisian and Low
Saxon, which have standardized spelling systems. For all varieties, the
rhythmic, lexical, and segmental contexts were comparable to the Standard
Dutch materials. A list of the sentences in all language versions is given in the

4.2.2 Varieties and subjects

Recordings were made in five locations along the Dutch coast, covering four
dialect groups (Figure 4.1). Zeelandic Dutch in Zuid-​Beveland (ZB), Southern
Hollandic in Rotterdam (RO), and Northern Hollandic in Amsterdam (AM)
belong to the Low Franconian dialect group. West Frisian was recorded in
Grou (GR) and Low Saxon in Winschoten (WI). The Standard Dutch (SD)
speakers were recorded in Nijmegen. Historically, Standard Dutch has close
relations to western varieties like Rotterdam and Amsterdam (cf. Smakman
2006 and references therein).
We recorded 119 speakers (between 18 and 23 speakers for each variety),
49 of whom were male. They were aged between 14 and 49. Participants
were university students (SD), secondary school students (ZB), members of

114 Judith Hanssen et al.


North Sea


Figure 4.1 Recording locations in the Netherlands

a Scouting club (RO, AM), or members of the local community (GR, WI).
The speakers from Zuid-​Beveland, Grou, and Winschoten were bilingual with
Standard Dutch and their local language. All regional speakers and at least
one of their parents were raised in the selected place and spoke the indigenous
variety fluently. For Standard Dutch, the procedure was different, as the area
where this variety is spoken is less determined by geographical boundaries.
Speakers could participate if they reported to speak Standard Dutch. Besides
self-​reporting, two Dutch phoneticians independently judged each recording.
Recordings were included if the judges agreed that the geographical and lin-
guistic origin of the participants could not be determined by their accent.
Except for the speakers of West Frisian and Standard Dutch, our speakers
were less familiar with their local language as a written language, which may
have had a negative influence on the fluency of the speech in the reading task
of some speakers.
Participants’ recordings were excluded if they were (highly) disfluent or
appeared to the experimenter not to speak naturally; if the speakers afterward
reported that they were dyslexic or had hearing problems; or if the speakers
turned out not to satisfy the requirements with respect to their linguistic or
geographical background. All participants were naive as to the purpose of the
task and were paid for their participation.

4.2.3 Recording procedure and data selection

To avoid listing effects, the 12 mini-​ dialogues were interspersed with 61
filler sentences (used for other experiments) and presented in a booklet, in
randomized order, which was reversed for half of the subjects per variety.

Geographical clines 115

Speakers were recorded in pairs to reduce any effects of the experimenter’s
presence and the nature of the task on their dialect level. One speaker read the
context sentence and the other the carrier sentence. The participants switched
roles at the end of the task after they had repeated any mispronounced
The Standard Dutch recordings were made in a professional studio at
Radboud University Nijmegen; recordings of the local varieties were made
in a quiet room either in the homes of our speakers or in a public building.
We used a portable digital recorder (Tascam HD P2 for Standard Dutch and
Zoom H4 for all other varieties) with a 48 kHz sampling rate, 16 bit resolution,
and stereo format. The participants wore head-​mounted Shure WH30XLR or
Sennheiser MKE 2 wired condenser microphones.
All recorded target sentences were converted to monaural files and
stored on computer disk as separate wav files with a sampling rate of 48
kHz and 16-​bit resolution. Utterances were excluded from further analysis
if they showed deviant pitch patterns due to accent position or choice of
nuclear pitch contour. More specifically, for the declarative condition, we
only included utterances that were realized with a nuclear falling contour
(H*L L%), and only utterances with a fall-​rise (H*L H%) were selected for
the rhetorical questions. Zuid-​ Beveland speakers often realized the fall-​
rise as a “rise-​rise”, that is, a sequence of rising movements without a low
turning point between the two peaks. Therefore, we only included the eight
ZB participants whose data could be labeled as H*L H%, that is, as fall-​rises
with a low turning point. A final remark with respect to data selection is
that speakers of Winschoten pronounced the trisyllabic target word in non-​
final falls “Manderen” as disyllabic [mɑndəːn] in over 70 percent of the cases,
whereas in other varieties it was realized with three syllables, [mɑndərə]. We
nevertheless included Winschoten in our analyses, and will interpret the
results in this context.
The total number of speakers whose data was used for analysis is given in
Table 4.2, broken down by variety, sentence condition, and gender.

Table 4.2 Number of speakers used in the analyses, broken down by variety, sentence
condition, and gender

nf-​FALLS f-F
​ ALLS f-​FR

F M total F M total F M total

SD 13 8 21 9 8 17 13 9 22
ZB 7 10 17 7 8 15 6 2 8
RO 7 12 19 3 10 13 7 8 15
AM 7 11 18 4 2 6 6 6 12
GR 20 3 23 18 3 21 20 2 22
WI 13 4 17 12 4 16 13 2 15

116 Judith Hanssen et al.

4.2.4 Variables and analysis

Acoustic and auditory analysis of the data was done with the help of the
speech processing software package Praat (Boersma and Weenink 2008). We
inserted the labels listed in Table 4.3 and stored their time (t) and f0 (f), which
was converted from Hz to semitones (ST re 100 Hz).
Segmental labels were all placed manually at segment boundaries. The
boundaries were determined according to general practice, on the basis of
visual inspection of waveform and broadband spectrogram, aided by audi-
tory information (Turk et al. 2006). We placed all labels at negative-​to-​positive
zero-​crossings. Tonal labels were either low (L) or high (H). L1, the elbow
before the peak, and H were determined semi-​automatically using a Praat
function that traces the location of the highest or lowest f0 value in a selected
interval. Determining the location of the elbow after the nuclear peak (L2)
was less straightforward, especially in those cases where contours displayed a
gradual change in slope (cf. Del Giudice et al. 2007; Petrone and D’Imperio
2009). To increase interrater agreement, we therefore determined L2 visually
by selecting the location of the highest change in the speed of the f0 movement
near the bottom line of the nuclear contour.4 If two elbows were visible in the
low-​pitched section after the peak, we selected the first one. Each label was
checked and corrected for tracking errors due to pitch perturbations.
Using the labels in Table 4.3, we computed the dependent variables listed
in Table 4.4.
Unless otherwise stated, we analyzed the data using the Linear Mixed
Effects Model in SPSS, including Speaker and Sentence as random factors,
and Dialect (SD, ZB, RO, AM, GR, WI) and Gender as fixed factors.
SentenceCondition (nf-​FALL, f-​FALL, f-​FR) was included as a fixed
factor in the model for those dependent variables that were measured for all
contours. Pairwise comparisons between the levels of the fixed factor were
carried out using the Bonferroni correction.

Table 4.3 Overview of acoustic measurement labels

Pitch targets nf-​FALL f-​FALL f-​FR

H maximum f0 of nuclear pitch accent ✓ ✓ ✓

(nuclear peak)
L2 elbow after nuclear peak ✓ ✓ ✓
H2 Maximum f0 of final boundary tone ✓
Boundaries segmental
O1 beginning of onset of nuclear syllable ✓ ✓ ✓
N1 beginning of rime of nuclear syllable ✓ ✓ ✓
C1 beginning of coda of nuclear syllable ✓ ✓ ✓
O2 end of rime of nuclear syllable ✓ ✓ ✓

Table 4.4 Acoustic variables used in the comparison of non-​final and final nuclear
contours in five varieties

formula nf-​FALL f-​FALL f-F

​ R

Durational variables
RimeDuration the duration of the t(O2) –​ t(N1) ✓ ✓ ✓
sonorant rime
of the nuclear
syllable in ms
Timing variables
H-​RelTiming the timing of H as (t(H) –​ t(N1)) /​ ✓ ✓ ✓
a proportion of (t(O2) –​ t(N1))
the sonorant rime * 100
duration in %
Scaling variables
H-​Scaling the height of the f(H) ✓ ✓ ✓
nuclear peak in
ST re 100 Hz
L-​Scaling the height of the f(L) ✓ ✓ ✓
elbow following
the nuclear peak
in ST re 100 Hz
H2-​Scaling the height of the f(H2) ✓
final boundary
tone in fall-​rises
in ST re 100 Hz
Contour shape variables
FallDuration the duration of the t(L) –​t(H) ✓ ✓ ✓
fall following
the nuclear peak
in ms
FallExcursion the excursion of f(L) –​f(H) ✓ ✓ ✓
the fall following
the nuclear peak
in ST
FallSlope the rate of change FallExcursion/​ ✓ ✓ ✓
of the fall FallDuration
following the *1000
nuclear peak in
RiseDuration the duration of the t(H2) –​t(L) ✓
final rise in fall-​
rises in ms
RiseExcursion the excursion of f(H2) –​f(L) ✓
the final rise in
fall-​rises in ST
RiseSlope the rate of change RiseExcursion/​ ✓
of final rise in RiseDuration
fall-​rises in ST/​s *1000
RatioFRDur relation between FallDuration /​ ✓
duration of RiseDuration
falling and rising
part of fall-​rise

118 Judith Hanssen et al.

Table 4.4 (Cont.)

formula nf-​FALL f-​FALL f-F

​ R
Shape ratios
RatioFRExc relation between FallExcursion /​ ✓
excursion of RiseExcursion
falling and rising
part of fall-​rise
RatioFRSlope relation between FallSlope /​ ✓
slope of falling RiseSlope
and rising part
of fall-​rise


Mean Rime_dur







Mean sonorant rime duration in non-​final falls, final falls, and final
Figure 4.2 
fall-​rises for each variety. Error bars represent ±2 standard errors of
the mean

Since female speakers on average speak at a higher pitch level than male
speakers (225 Hz vs. 125 Hz), we measured f0 in semitones. This will to a large
extent normalize gender variation where excursion sizes are concerned, but
will not normalize differences in the scaling of individual pitch targets (such
as the scaling of the nuclear peak). The effects of Dialect on tonal scaling
(H-​Scaling, L-​Scaling, H2-​Scaling) will therefore be reported for the lar-
gest gender group only, female speakers

Geographical clines 119

Table 4.5 Effect of Dialect, Gender, and Sentence_​condition on

Dialect F(5,111) = 3.83 p < .01

Gender F(1,110) = 22.16 p < .001
Sentence_​condition F(2,10) = 62.86 p < .001
Dialect × Sentence_​condition F(10,957) = 3.48 p < .001
Gender × Sentence_​condition F(2,964) = 6.96 p < .001

Table 4.6 Effect of Dialect and Gender on RimeDuration in non-​final falls, final

falls, and final fall-​rises

Dialect Gender

nf-​FALL F(5,104) = 6.35 p < .001 F(1,104) = 12.20 p < .001

f-​FALL F(5,76) = 2.61 p < .05 F(1,76) = 15.29 p < .001
f-​FR F(5,83) = 2.37 p < .05 F(1,83) = 9.56 p < .01

4.3 Results

4.3.1 Sonorant rime duration

The bar chart in Figure 4.2, which gives sonorant rime durations by contour
type for varieties separately, allows us to make two observations. First, son-
orant rime duration increases from nf-​FALLS, to f-​FALLS and f-​FR. This
pattern holds across all dialects. Second, rime durations tend to gradually
increase from the southwest (ZB) to the northeast (WI).
As Table 4.5 illustrates, we found main effects of Dialect, Gender, and
Sentence_​condition on the duration of the sonorant rime, and interactions
between Dialect x Sentence_​condition, and Gender x Sentence_​
condition .
Post-​hoc tests show that WI RimeDuration is significantly longer than
SD (p<.001) and ZB (p<.05). Women have significantly longer sonorant rime
durations than men (on average, 210 ms vs 186 ms), p<.001. Finally, f-​FR son-
orant rime durations are significantly longer than f-​FR (225 vs 196, p<.001).
nf-​FALLS have the shortest sonorant rime (173 averaged over varieties), but
are not significantly different from other sentence conditions.
If we look at the sentence conditions separately, as in Table 6, we find main
effects of Dialect and Gender but no interaction for nf-​FALLS, f-​FALLS,
and f-​FR. Female speakers had significantly longer sonorant rime durations
than male speakers in all three sentence conditions. As for Dialect, post-​hoc
comparisons (Table 4.7) showed that the main effect of Dialect in nf-​FALLS
was due to the short rime durations in SD compared to the other varieties. In
f-​FALLS it was due to the difference between ZB and WI, and in f-​FR to the
difference between SD and WI.

120 Judith Hanssen et al.

Table 4.7 Pairwise comparisons for RimeDuration between levels of Dialect,
separately for each sentence condition

Levels SD vs. ZB ZB vs. RO RO vs. AM AM vs. GR GR vs. WI

nf-​FALL **
SD vs. RO ZB vs. AM RO vs. GR AM vs. WI
nf-​FALL *
SD vs. AM ZB vs. GR RO vs. WI
nf-​FALL **
SD vs. GR ZB vs. WI
nf-​FALL **
f-​FALL *
SD vs. WI
nf-​FALL ***
f-​FR *

4.3.2 Peak timing

In Hanssen (2017: 51), evidence is reported showing that Dutch dialect
speakers time the f0 peak in monosyllabic words such that a constant propor-
tion of the word is available for the realization of the falling pitch movement.
That is, the peak in shorter words was timed earlier when measured relative
to the beginning of the vowel and later when measured relative to the end of
the word, but no difference in proportional peak timing was found between
shorter and longer words.
The bars in Figure 4.3 suggest that (a) regional differences in propor-
tional peak timing do exist in our data (and are not a mere consequence of
differences in segmental duration) and that (b) regional variation in peak
timing would appear to interact with contour condition. In non-​final falls,
peak timing is early in ZB and WI; intermediate in RO, GR, and SD; and late
in AM. Specifically, in ZB and WI the peak falls from the end of the vowel; in
RO, GR, and SD it falls at a point just after the beginning of the coda; and in
AM it falls around the end of the accented syllable. In f-​FALLS, the earliest
peaks occur in ZB and the latest in GR. In f-​FR, RO has the latest peaks,
while AM has the earliest.
As Table 4.8 shows, we found main effects of Dialect and Sentence_​
condition on proportional peak timing, and interactions between D ialect
x Sentence_​condition, Gender x Sentence_​condition, and Dialect x
Gender x Sentence_​condition.

Geographical clines 121


Mean H_REL timing






Figure 4.3 Mean proportional peak timing in non-​final falls, final falls, and final fall-​
rises for each variety. Error bars represent ±2 standard errors of the mean

Table 4.8 Effect of Dialect, Gender, and Sentence_​condition on


Dialect F(5,108) = 2.91 p < .05

Gender F(1,108) = 1.63 n.s.
Sentence_​condition F(2,34) = 408.51 p < .001
Dialect × Sentence_​condition F(10,974) = 11.32 p < .001
Gender × Sentence_​condition F(2,982) = 4.19 p < .05
Dialect × Gender x Sentence_​condition F(10,972) = 1.92 p < .05

Post-​hoc tests show that the proportional peak in AM is timed significantly

later than in ZB (p<.01). They also show that all sentence conditions differ
significantly from one another (p<.001 for all levels of the comparison), with
estimated mean proportional peak timings of 63 percent (nf-​FALL), 25 per-
cent (f-​FALL) and 20 percent (f-​FR).
Looking at each sentence condition separately, we find a main effect of
Dialect [F(5,106) = 9.88, p<.001] and Gender [F(1,106) = 4.44, p<.05], no
interaction, in nf-​FALLS only. Female speakers time the location of the non-​
final nuclear peak at around 66 percent of the sonorant syllable, compared
to 60 percent for male speakers. Post-​hoc tests for Dialect show that in

122 Judith Hanssen et al.

(a) (b)

20 20

15 15

10 10

5 5

0 0

H1_Scaling L2_Scaling

Figure 4.4 Mean scaling in semitones of H and L in nf-​FALLS (left-​hand panel)

and f-​FALLS (right-​hand panel) for each variety. Error bars represent ±2
standard errors of the mean

nf-​FALLS, AM peaks are significantly later than all other varieties, with
mean differences ranging from 18 (AM-​RO) to 29 percent (AM-​WI).

4.3.3 Scaling of tonal targets

Next, we address differences in the scaling of individual pitch targets and
overall pitch level, as illustrated in Figure 4.4 for nf-​FALLS and f-​FALLS,
and in Figure 5 for f-​FR. As observed in Section 4.2.4, converting from Hz to
semitones will not fully neutralize gender variation, for which reason we here
report on female speakers only. Scaling of the peak (H) and the following low
target (L2) can be analyzed across sentence conditions. Scaling of the final
high (H2) only occurs in fall-​rises.
For the scaling of the nuclear peak we found a main effect of Sentence_​
condition [F(2,25) = 37.78, p<.001] and a D ialect * S entence _​c ondi -
tion interaction [F(10,592) = 5.04, p<.001], but no main effect of D ialect .
Post-​hoc tests show that all sentence conditions differ significantly from one
another (all pairs p<.001), with highest peaks in nf-​FALLS (18.3 ST) followed

Geographical clines 123






H1_Scaling L1_Scaling H2_Scaling

Figure 4.5 Mean scaling in semitones of H, L, and H2 in f-​FR for each variety. Error
bars represent ±2 standard errors of the mean

by f-​FR (16.7 ST) and f-​FALLS (15.9 ST). Although peaks are always highest
in non-​final falls, they are not always lowest in final falls, which may have
caused the interaction.
As for scaling of the low target (L), we found a main effect of Dialect
[F(5,64) = 2.97, p<.05], of Sentence_​condition [F(2,13) = 152.40, p<.001],
and a Dialect * Sentence_​condition interaction [F(10,592) = 4.83,
p<.001]. Post-​ hoc tests revealed no significant differences in L-​ Scaling
between any of the dialects, but showed that L in fall-​rises was significantly
higher (13.1 ST) than both non-​final (10.3 ST) and final falls (9.6 ST) at
Separate analyses for final and non-​ final falls showed that H-​ Scaling
was significantly affected by Dialect in f-​FALLS [F(5,48) = 2.71, p<.05],
although none of the varieties differed significantly in post-​hoc tests. In f-​FR,
Dialect did not significantly affect H-​Scaling, but as Table 4.9 shows, we
did find a main effect of Dialect on L-​Scaling and H2-​Scaling.

Table 4.9 Effects of Dialect on the scaling of the nuclear peak, the elbow, and the
final high target in final fall-​rises

H1_​Scaling F(5, 59) = 1.33 n.s.

L_​Scaling F(5, 59) = 5.17 p < .001
H2_​Scaling F(5, 59) = 3.37 p < .01

124 Judith Hanssen et al.

Bonferroni post-​hoc comparisons showed that L was significantly higher
in ZB compared to AM (p<.05) and WI (p<.01), and significantly higher in
RO compared to WI (p<.05). Scaling of the second high-​target H2 was sig-
nificantly higher in SD compared to WI (p<.05) and RO compared to WI
In sum, even though Dialect did not have a systematic effect on the
nuclear peak, inspection of Figures 4.4 and 4.5 suggests that there are wider
falls in the northeast than in the southwest. Section 4.3.4 looks more specific-
ally at such differences in contour shape.

4.3.4 Contour shape: F0 duration, excursion, and slope

In this part of the results, we compare the shape of nuclear contours between
varieties by looking at f0 duration, f0 excursion, and slope of the pitch
movements. Note that these variables are not comparable between non-​final
and final falls on the one hand, and fall-​rises on the other. For falls, the variables
apply to the f0 stretch between the peak (H*) and the subsequent elbow (L),
disregarding the level or slowly falling pitch after the elbow. For fall-​rises, the
variables are measured separately for the falling (H* to L) and the rising (L to
H%) part of the pitch contour. We will therefore present the results for con-
tour shape separately for falls (Section and fall-​rises (Section].
Gender returns as an independent variable since differences in excursion are
effectively neutralized by measuring f0 in semitones. Non-​final and final falls

As the bar charts in Figure 4.6 show, f0 duration, f0 excursion, and f0 slope
vary with position. For all varieties, f0 duration is longer, f0 excursion larger,
and f0 slope is less steep in non-​final falls, compared to final falls. Indeed, we
find significant effects of Position for all three variables, along with signifi-
cant effects of Dialect and Dialect*Position interactions. The difference
between the two positions is larger in some varieties than in others (compare
RO and WI, for example). Amsterdam stands out in this respect, with particu-
larly long f0 durations and shallow slopes in nf-​FALLS. We will come back
to this result in the discussion. Below, we look at the effect of Dialect on
contour shape separately for non-​final and final falls.
In nf-​ FALLS, we found a significant main effect of Dialect on
FallDuration, FallExcursion, and FallSlope. Gender did not sig-
nificantly affect any of the dependent variables. The results are summarized
in Table 4.10. Bonferroni post-​ hoc comparisons for FallDuration
revealed that AM had significantly longer durations than all other varieties
at the p<.001 level. None of the other varieties differed significantly from
one another. For FallExcursion, post-​hoc tests show that ZB excursion
is significantly smaller than all other varieties (ZB-​SD and ZB-​RO p<.05,
ZB-​WI p<.01, and ZB-​AM and ZB-​GR p<.001). Finally, post-​hoc tests for
600 100

550 10 90

450 8
300 50

250 40
100 2

50 10

0 0 0


Figure 4.6 Mean f0 duration in ms (left panel), f0 excursion in ST (center panel) and f0 slope in ST/​s (right panel) for non-​final and final
falls, broken down by dialect. Error bars represent ±2 standard errors of the mean

126 Judith Hanssen et al.

Table 4.10 Effects of Dialect on FallDuration, FallExcursion, and
FallSlope in non-​final falls and final falls

nf-​FALL F(5,107) = 24.10 p<.001
f-​FALL F(5,76) = 3.18 p<.05
nf-​FALL F(5,103) = 6.51 p<.001
f-​FALL F(5,76) = 4.52 p<.001
nf-​FALL F(5,105) = 8.45 p<.001
f-​FALL F(5,77) = 5.19 p<.001

FallSlope show that falls are significantly less steep in ZB compared to GR

(p<.001) and WI (p<.05), and less steep in AM compared to SD, RO, GR,
and WI (AM-​SD p<.01, AM-​RO p<.05, AM-​GR and AM-​WI p<.001).
Continuing with f-​FALLS, Table 4.10 shows main effects of Dialect
for all three variables. Additionally, we found an effect of Gender on
FallSlope in final falls [F(1,77) = 4.18, p<.05]. Bonferroni comparisons
for FallDuration revealed one significantly different dialect pair: SD-​WI
(p<.05). For both FallExcursion and FallSlope, we found that GR had
significantly larger and steeper falls than SD (p<.05 for both variables) and
ZB (p<.01 and p<.001, respectively). Fall-​rises
Regional differences in the shape of the fall-​rise may be due to differences in
the shape of the falling movement (H* to L), of the final rise (L to H%), or
both. We therefore measured pitch movement duration, excursion, and slope
separately for the falling movement (FallDuration, FallExcursion,
and FallSlope) and the final rise (RiseDuration, RiseExcursion, and
RiseSlope). The bar charts in Figure 4.7 show that there are rather large
differences in the shape of the fall-​rise, with ZB, GR, and WI differing most
from the other varieties, each in its own way. For the falling movement, WI
has the longest duration, largest excursion, and steepest slope. ZB, on the
other hand, has the shortest duration, smallest excursion, and shallowest
slope for both the falling and the rising movement. In GR, the difference
between the falling and rising part is small in terms of duration, excursion,
and slope. These examples illustrate that regional patterns vary for the falling
movement, the rising movement, and the relation between the two.
We found a main effect of Dialect on FallDuration [F(5,83) = 2.66,
p<.05] and RiseSlope [F(5,83) = 2.94, p<.05]. We also found a main effect of
Gender on RiseDuration [F(1,85) = 4.69, p<.05] and a Gender*Dialect
interaction [F(1,84) = 2.42, p<.05] for FallExcursion.
110 7 60
6 50
80 5
30 2

20 10
0 0 0
F0Dur_FR1 F0Dur_FR2 F0Exc_FR1 F0Exc_FR2 RofCh_FR1 RofCh_FR2

Figure 4.7 Mean f0 duration in ms (left panel), f0 excursion in ST (center panel), and slope in ST/​s (right panel) of the falling (FR1)
and rising (FR2) movements of final fall-​rises. Error bars represent ±2 standard errors of the mean

128 Judith Hanssen et al.

Inspection of the data showed that female speakers had significantly longer
final rise durations than male speakers (88 vs 78 ms). Post-​hoc comparisons
for FallDuration showed that f0 duration of the falling movement was
significantly longer in WI compared to ZB and RO (both at p<.05). For
RiseSlope, no dialect pairs were significantly different in post-​hoc tests. The
interaction between Dialect and Gender for FallExcursion may have
been caused by the fact that female speakers had a larger excursion in some
(WI, GR) but not in other dialects (SD, ZB).
Figure 4.7 also suggests that regional differences exist in the way the fall
and rise of H*L H% relate to each other. In SD, ZB, RO, and AM, the falling
movement is shorter, smaller, and shallower than the rising movement. In
GR, the two movements are almost equal in duration, excursion, and slope,
whereas in WI, the pattern is the opposite of the southwestern and central
varieties, with the falling movement being longer, larger, and steeper than the
final rise.
Figure 4.8 shows the duration, f0-​excursion, and f0-​slope ratios between
the falling and rising movements in each dialect in order to further illustrate
the regional differences in the pronunciation of the IP-​final fall-​rise. Ratios
were calculated by dividing the value of the falling movement by that of the
rising movement for each speaker. A ratio smaller than 1 means that the mean
value for a variable of the falling movement is smaller than that of the rising
movement. For all three variables, the values for rises increase, quite clearly
so for the inherently related f0 excursion and f0 slope. Statistical analyses
revealed a significant main effect of Dialect for duration [F(5,86) = 2.67,
p<.05] and f0-​excursion [F(5,83) = 2.84, p<.05]. Post-​hoc tests for the dur-
ation ratio did not reveal significantly different dialect pairs. For the excursion
ratio, they showed that WI is significantly different from SD (p<.05).

4.4 Summary and discussion

Earlier investigations of regional differences in the realization of intonation
have mainly reported variation in tonal timing. Our more comprehensive
study included other variables, such as segmental and contour duration,
scaling, and contour shape. We will sum up the main results, paying attention
to the effect of gender (4.4.1), sentence condition (4.4.2), and dialect (4.4.3).
We will compare our findings to the results reported in Peters et al. (2014) to
see if we replicate their finding of a geographical cline in the realization of
non-​final and IP-​final falling pitch accents as well as IP-​final falling-​rising
pitch accents.

4.4.1 Effects of gender

Apart from the inherent effect on scaling, the most systematic effect of gender
was found for segmental duration. Rime durations were longer for women
across sentence conditions. Women also timed their peaks later than men, most

Geographical clines 129







RatioFRDur RatioFRExc RatioFRSlope

Figure 4.8 Duration ratio, excursion ratio, and slope ratio between the falling and
rising movement of f-​FR

notably in non-​final falls, which may partially be due to the larger numbers
of women in the GR and WI groups. In final falls, male speakers showed
a steeper falling slope, though not consistently across all dialects. Finally,
women produced longer, and in some dialects also larger, final rises in H*L
H% nuclear accents. These features might be explained as an enhanced use of
the high-​pitched end of Ohala’s (1983) Frequency Code (Gussenhoven 2016).

4.4.2 Effects of sentence condition

The three sentence conditions varied systematically with respect to sonorant
rime duration and timing of the nuclear peak. Rime durations were shorter,
and peak timing earlier, in non-​final falls compared to final falls, and in final

130 Judith Hanssen et al.

falls compared to final fall-​rises. Non-​final peaks were also scaled higher.
These effects can be interpreted as responses to time pressure. Speakers can
increase the duration of segmental material, and retract and lower tonal
targets to create more space for the realization of contours in case of an
upcoming IP-​final boundary or in case of more complex intonation contours
(e.g., Steele 1986; Wightman et al. 1992; Prieto et al. 1995; Grabe 1998). Final
and non-​final falls could additionally be compared in terms of shape (e.g.,
Grabe et al. 2000). Final falls were realized with shorter f0 durations, shorter
excursions, and hence steeper falling slopes, all of which can also be attributed
to the lack of space to realize the contours.
The realization of nuclear contour types was not affected uniformly
across the dialects. First, the difference in sonorant rime duration between
the two falling contours is much smaller in Zuid-​Beveland than in other var-
ieties, particularly Winschoten. Secondly, the difference between non-​final
and final proportional peak timing in falls is much smaller in Winschoten
than in Standard Dutch, and particularly Amsterdam (see Section 4.4.3).
In fall-​rises, the difference in timing between final falls and fall-​rises in
Zuid-​Beveland is particularly small compared to Grou. Lastly, while final
falls had a steeper falling slope in all varieties, differences could be observed
in the sources that governed the slope. For example, excursion sizes varied
much less between non-​final and final falls in Grou and Winschoten, while
fall duration was reduced. In Standard Dutch and Rotterdam, steeper
slopes were also caused by shorter durations, while excursions were reduced
at the same time.

4.4.3 Effects of dialect Duration
Sonorant rime durations gradually increased from the southwest (ZB) to the
northeast (GR, WI), showing a weak geographical component. ZB generally
had the shortest durations, and WI the longest, matching the first half of
the inverted U-​shape reported in Peters et al. (2014). The short segmental
durations for ZB and the long ones in GR are in agreement with that study,
which investigated the effects of focus condition on the realization of non-​
final declarative falls in varieties of Dutch (but which did not look at Standard
Recall from Section 4.2.3 that speakers from WI often pronounced the
target words, for example, “Manderen” as a disyllabic word, [mɑndə(ː)n]
instead of [mɑndərə]. This reduction might partly explain the long rime
durations, since the fewer unstressed syllables that occur after the main stress
in a word, the longer its stressed syllable (Nooteboom 1972; Rietveld et al.
2004). However, because in the case of final falls and final fall-​rises seg-
mental duration was always longest in WI for identically pronounced target
words, speakers of WI may safely be said to have longer segmental durations

Geographical clines 131

than speakers of other varieties. Verhoeven et al. (2004) similarly reported
that speech tempo was slowest for speakers from the northern periphery of
the Dutch-​speaking area. However, their finding that this also goes for the
southern periphery is not confirmed in the ZB data, the southernmost dialect
in our investigation, though not in the Dutch language area.5 Peak timing

Dialect had an effect on proportional peak timing. Details differed with
contour condition. In non-​final falls, proportional peak timing tended to be
earliest in ZB and WI and latest in AM, a geographical pattern that closely
resembles the results in Peters et al. (2014). The other contour classes showed
a different pattern, although ZB always belonged to the early-​peak group. In
WI, peaks were late in the two final conditions, but early in non-​final falls.
Since the duration of the final accented syllable rime was particularly long in
WI, the late timing of the final peaks can be attributed to the generous avail-
ability of sonorant segments.
Its late-​peak accents make AM stand out from the other varieties. Speakers
produced a combination of normal-​peak and late-​peak falling accents, with
or without a final boundary tone for non-​final falls. Between-​speaker and
within-​speaker variation was considerable, and we could not always confi-
dently label the pitch accent as either a late or a regular fall. Scaling
Across sentence conditions, Dialect did not systematically affect the scaling
of nuclear peaks or overall pitch level, as was found between the standard
varieties of Dutch in the Netherlands and Belgium by van Bezooijen (1993).
The largest effect of Dialect on scaling was found in the fall-​rises, where
the valley between the two high tones was realized at much higher f0 in ZB
and RO than in AM and WI. The final high tone was scaled highest in WI.
Differences in the depth of the valley and height of the final high tone have
consequences for the excursion sizes of the falling and rising movements of
the fall-​rise, as Figure 4.7 suggests. Contour shape

Starting with the effect of Dialect on the shape of nuclear falling melodies,
we have seen that the fall (from nuclear peak to subsequent elbow) was par-
ticularly long and shallow in Amsterdam. This follows directly from the
presence of late-​peak, slowly falling nuclear accents in that variety. Moreover,
southwestern ZB falls tend to be small and shallow, contrasting with the large
and steep falls in GR in the northeast. Comparing our non-​final data with
the results for contour shape in Peters et al. (2014), we also found an inverted
U-​shape for f0 excursion. The U-​shape for f0 slope in Peters et al. (2014) is not

132 Judith Hanssen et al.

ZB Central WI

Figure 4.9 Schematic representation of fall-​rise types in peripheral (Zuid-​Beveland

and Winschoten) and central varieties

replicated in our findings, because fall slope for ZB is shallow in our results,
but steep in theirs. Apart from that difference, the dialects show similar
behavior in the two studies. Figure 9 provides schematic representations of
the shape of non-​final and final falls.
The fall-​rise was rather similar in shape in SD, RO, AM, and GR (see
Figure 4.7 in Section, the “central” realization. ZB and WI deviated
from it in their own ways. Speakers of WI realized it with longer, larger and
steeper falling movements and shorter and shallower final rises than speakers
of the other varieties. As a result, the ratio between the falling and rising
movements in WI differed substantially from that in the other varieties, in
which the shape of the falling and rising movements were comparable (GR)
or where the fall was short, small, and shallow compared to the final rise. The
stylized contours in Figure 4.9 show the distinct shape of the fall-​rise in WI
in comparison with the “central” version and ZB. As shown in Figure 4.10,
the shape of the ZB fall-​rise is characterized by a shallow dip between the two
high peaks. In Hanssen (2017: 149), it is shown that speakers of ZB often do
not produce such a dip at all. These extremely shallow realizations by speakers
of Zuid-​Beveland represent a context-​specific response to time pressure that is
absent in the other varieties.
Significantly, the most extreme realizational variation could be observed in
the two geographically most extreme dialects, Zuid-​Beveland and Winschoten.
They differed most from each other as well as from the other varieties. In fact,
if we look at the significantly different dialect pairs, 67 percent of them (29
out of 43) involve ZB, WI, or both. In only 14 out of 43 cases are dialects
other than ZB and WI involved in the comparison. Thus, Zuid-​Beveland and

Geographical clines 133

Winschoten represent two extreme ends of a scale with respect to segmental
duration, and excursion and shape of the pitch movements, with Standard
Dutch, Rotterdam, Amsterdam, and Grou generally in between. Interestingly,
the variety of Weener Low Saxon can often be placed geographically after
Winschoten, logically finishing the inverted U-​shape (Peters et al. 2014, 2015).
Except for variables related to the AM late fall, dependent variable means
in SD, RO, and AM often resembled one another. This reflects the fact that the
standard variety has its roots in the central western varieties of the Netherlands.
In fact, for many variables, we could observe a tendency for a gradual shift in
the mean from the southeastern ZB via the central varieties of RO, AM, and
SD, and on to the northeastern varieties of GR and WI. Such intonational
geographical clines were first reported in Peters et al. (2014, 2015) for Dutch,
Frisian, and Low and High Saxon intonation. While these clines are common-
place in segmental sociolinguistic research, modulo social and natural bound-
aries (Britain 2013), their documentation is new in the field of intonation.

1 The differences in tonal timing between the Lowland Scots varieties of Orkney
and Shetland have an additional effect on syllable duration, which is longer for the
Shetland variety. Van Leyden (2004: 69) attributes the longer syllable duration to
the fact that in Shetland, the entire rising movement is realized on the accented
syllable, while in Orkney the peak of the rise is not realized until after the accented
2 In addition, Ulbrich found differences in overall speech rate between the two var-
ieties, which were caused by the number and duration of pauses within sentences.
3 Speakers of Standard Dutch produced three sentences each, with the target words
Manderen, Bunderen, and Lunteren.
4 This point roughly corresponds to the location in the f0 curve where the first deriva-
tive (or slope) is zero. As such, our method is the manual version of the MOMEL
algorithm used as input for the INTSINT transcription system for the representa-
tion of intonation (Hirst 2005).
5 Verhoeven et al.’s work was criticized for methodological errors in Quené (2008),
who nevertheless reached the same conclusion regarding differences in speech
tempo between speakers from Flanders and the Netherlands, and between male
and female speakers.

A context sentence
B carrier sentence, target word in bold (only for Dutch)
i Dutch (Zuid-​Beveland, Rotterdam, Amsterdam)
ii West Frisian (Grou)
iii Low Saxon (Winschoten)
iv English gloss
v English translation
vi Standard Dutch pilot

134 Judith Hanssen et al.

NB. Where materials differ between the varieties, separate English glosses and
translations are provided for each, indicated by slashes (/​).

Non-​final falls (4)

Sentence 1
A i Waar zouden jullie willen blijven?
ii Wêr soenen jimme bliuwe wolle?
iii Woar zollen ie langes willen?
iv where would you want stay /​stay want //​
where would you by want
v Where would you like to stay? //​
Where would you like to pass by?

Bi We zouden bij Munderen willen blijven.

ii Wy soenen by Munderen bliuwe wolle.
iii Wie zollen bie Munderen langes willen.
iv we would at munderen want stay /​stay want //​
we would at munderen by want
v We’d like to stay at Munderen //​
We’d like to pass by Munderen.

Sentence 2
Ai Waar zouden de Janssens heen willen lopen?
ii Wêr soenen de Janssens hinne rinne wolle?
iii Woar zollen de Janssens hinlopen willen?
iv where would the johnsons to want walk /​walk want
v Whereto would the Johnsons like to walk?

Bi Ze zouden naar Memberen willen lopen.

ii Se soenen nei Memberen rinne wolle.
iii Zai zollen noar Memberen lopen willen.
iv they would to Memberen want walk /​walk want
v They’d like to walk to Memberen.

Sentence 3
Ai Waar zou Karel je heen willen brengen?
ii Wêr soe Karel dy hinne bringe wolle?
iii Woar zol Karel die hinbringen willen?
iv where would Karel you to want bring /​bring want
v Where did Karel want to take you to?

B i Hij zou me naar Momberen willen brengen.

ii Hy soe my nei Momberen bringe wolle.
iii Hai zol mie noar Momberen bringen willen.
iv he would me to Momberen want bring /​bring want
v He wanted to take me to Momberen.

Geographical clines 135

Sentence 4

Ai Waar zouden je oom en tante willen wonen?

ii Wêr soenen jim omke en tante wenje wolle?
iii Woar zollen dien oom en tante wonen willen?
iv where would your uncle and aunt want live /​live want
v Where would your uncle and aunt like to live?

Bi Ze zouden bij Manderen willen wonen.

ii Se soenen by Manderen wenje wolle.
iii Ze zollen bie Manderen wonen willen.
iv they would at Manderen want live /​live want
v They’d like to live near Manderen.
vi Standard Dutch (pilot), n=3

1A Wat wil de familie Mols morgen doen?

B Ze zouden naar Bunderen willen fietsen.
2A En, zijn jullie van ’t weekend nog weggeweest?
B Ja, we waren in Lunteren gaan logeren.
3A Wat is er met jullie?
B We willen in Manderen blijven wonen.

Final falls (4)

Sentence 1
Ai Met wie gaat Roel naar ’t concert?
ii Mei wa sil Roel nei ’t konsert?
iii Mit wel gaait Roel noar ’t concert?
iv with who goes Roel to the concert
v With whom is Roel going to the concert?

Bi Hij gaat met Marjel de Lof.

ii Hy sil mei Marjel de Lof.
iii Hai gaait mit Marjel de Lof.
iv he goes with Marjel de Lof
v He’s going with Marjel de Lof.

Sentence 2
Ai Van wie is dat dikke boek?
ii Fan wa is dat tsjûke boek?
iii Van wel is dat dikke bouk?
iv of who is that big book
v Whose big book is that?

Bi Dat is van Professor Loof.

ii Dat is fan Professor Loof.
iii Dat is van Professor Loof.
iv that is of Professor Loof
v It belongs to Professor Loof.

136 Judith Hanssen et al.

Sentence 3
Ai Met wie gingen de kinderen naar de dierentuin?
ii Mei wa gienen de bern nei de dieretún?
iii Mit wel gingen de kiender noar de daierntoene?
iv with who went the children to the zoo
v Who did the children go to the zoo with?

Bi Ze gingen met meester Lom.

ii Se gienen mei meester Lom.
iii Zai gingen mit meester Lom.
iv they went with mister Lom
v They went with Mr Lom.

Sentence 4
Ai Met wie gaat je baas morgen trouwen?
ii Mei wa sil dyn baas moarn trouwe?
iii Mit wel gaait dien boas mörgen traauwen?
iv with who goes your boss tomorrow marry
v Who will your boss marry tomorrow?

Bi Hij trouwt met mevrouw de Loom.

ii Hy trout mei mefrou de Loom.
iii Hai traauwt mit vraauw de Loom.
iv he marries with Mrs de Loom
v He’ll marry Mrs de Loom.
vi Standard Dutch (pilot), n=4

1A Met wie gaat Roel naar ’t concert?

B Hij gaat met meneer ’t Lof.
2A Van wie is dat dikke boek?
B Dat is van Professor Loof.
3A Met wie gingen de kinderen naar de dierentuin?
B Ze gingen met meester Lom.
4A Met wie gaat je baas morgen trouwen?
B Hij trouwt met mevrouw de Loom.

Final fall-​rises (4)

Sentence 1
Ai Ga je mee naar Bakkerij ’t Stoepje?
ii Giest do mei nei Bakkerij ’t Stoepje?
iii Gaaist doe mit noar Bakkerij ’t Stoepje?
iv go you along to bakery ’t Stoepje?
v Do you want to come along to Bakery ’t Stoepje?

B i We gaan toch naar Bakker Lof?

ii Wy gean dochs nei Bakker Lof?

Geographical clines 137

iii Wie goan toch noar Bakker Lof?
iv we go actually to baker Lof ?
v But aren’t we going to the Lof Bakery?

Sentence 2
Ai Meester Boelens gaat mee op schoolreis.
ii Master Boelens sil mei op skoalreis.
iii Meester Boelens gaait mit op schoulraaise.
iv Mr Boelens goes along on schooltrip
v Mister Boelens is coming along on the school trip.

Bi Je ging toch met meester Loof?

ii Do giest dochs mei master Loof?
iii Doe gingst toch mit meester Loof?
iv you went actually with mister Loof ?
v But weren’t you going with Mr Loof?

Sentence 3
Ai Pepijn de Heer komt straks ook naar ’t feest.
ii Pepijn de Heer komt strak ek nei ’t feest.
iii Pepijn de Heer komt straks ook noar ’t feest.
iv Pepijn de Heer comes later also to the party
v Pepijn de Heer is also coming to the party later.

Bi Hij heet toch Pepijn de Lom?

ii Hy hjit dochs Pepijn de Lom?
iii Hai hait toch Pepijn de Lom?
iv he calls actually Pepijn de Lom?
v But isn’t he called Pepijn de Lom?

Sentence 4
Ai Dit antieke horloge is nog van opa Thijssen geweest.
ii Dit antike horloazje hat noch fan pake Thijssen west.
iii Dit antieke hallozie is nog van opa Thijssen west.
iv this antique wristwatch is even of grandfather Thijssen been
v This antique wristwatch used to belong to grandfather Thijssen.

Bi Het was toch van opa Loom?

ii It wie dochs fan pake Loom?
iii Het was toch van opa Loom?
iv it was actually from grandfather Loom?
v But didn’t it belong to grandfather Loom?
vi Standard Dutch, n=4

1A Ga je mee naar Café de Engel?

B Je wilde toch naar ’t Lof?
2A Meester Boelens gaat mee op schoolreis.
B Je ging toch met Mr Loof?

138 Judith Hanssen et al.

3A Mies de Heer komt straks ook naar ’t feest.
B Ze heette toch Mies de Lom?
4A Dit antieke horloge is nog van opa Thijssen geweest.
B Het was toch van opa Loom?

Arvaniti, A., and Garding, G. (2007) “Dialectal variation in the rising accents of
American English” in Hualde, J., and Cole, J. (eds.) Papers in laboratory phonology
9. Berlin: Mouton de Gruyter, pp. 547–​576.
Atterer, M., and Ladd, D. R. (2004) “On the phonetics and phonology of ‘segmental
anchoring’ of F0: Evidence from German”, Journal of Phonetics, 32, 177–​197.
Boersma, P., and Weenink, D. (2008) Praat: Doing phonetics by computer (Version
5.0.25) [computer program]. Retrieved 31 May 2008 from​
Britain, D. (2013) “Space, diffusion and mobility” in Chambers, J. K., and Schilling-​
Estes, N. (eds.) Handbook of language variation and change, 2nd edition. Hoboken,
NJ: Wiley-​Blackwell, pp. 471–​500.
Daan, J. (1938) “Dialect and pitch pattern of the sentence” in Blancquaert, E., and
Pée, W. (eds.) Proceedings of the Third International Congress of Phonetic Sciences.
Ghent: Laboratory of the Phonetics of the University, pp. 473–​480.
Del Giudice, A., Shosted, R., Davidson, K., Salihie, M., and Arvaniti, A.
(2007) “Comparing methods for locating pitch ‘elbows’” in Trouvain, J. (ed.)
Proceedings of the 16th International Congress of Phonetic Sciences (ICPhS).
Saarbrücken: Universität des Saarlandes, pp. 1117–​1120.
Gilles, P. (2005) Regionale Prosodie im Deutschen: Variabilität in der Intonation von
Abschluss und Weiterweisung. Berlin: Walter de Gruyter.
Grabe, E. (1998) “Pitch accent realization in English and German”, Journal of
Phonetics, 26, pp. 129–​144.
Grabe, E., Post, B., Nolan, F., and Farrar, K. (2000) “Pitch accent realization in four
varieties of British English”, Journal of Phonetics, 28, pp. 161–​185.
Gussenhoven, C. (2016) “Foundations of intonational meaning: Anatomical and
physiological factors”, Topics in Cognitive Science, 8, pp. 425–​434.
Gussenhoven, C., and Rietveld, A. C. M. (1992) “Intonation contours, prosodic struc-
ture, and preboundary lenghthening”, Journal of Phonetics, 20, pp. 283–​303.
Gussenhoven, C., and van der Vliet, P. (1999) “The phonology of tone and intonation
in the Dutch dialect of Venlo”, Journal of Linguistics, 35, pp. 99–​135.
Hanssen, J. (2017) Regional variation in the realization of intonation contours in the
Netherlands. Utrecht: LOT Publications.
Hanssen, J., Peters, J., and Gussenhoven, C. (2016) “Phonetic effects of focus in five
varieties of Dutch” in Barnes, J., Brugos, A., and Shattuck-​Hufnagel, S. (eds.)
Speech Prosody 2016. Boston: International Speech Communication Association
(ISCA), pp. 736–​740.
Hirst, D. J. (2005) “Form and function in the representation of speech prosody”,
Speech Communication, 46, 334–​347.
Kalaldeh, R., Dorn, A., and Ní Chasaide, A. (2009) “Tonal alignment in three var-
ieties of Hiberno-​English” in International Speech Communication Association
(ed.) Proceedings of Interspeech 2009. Brighton: ISCA, pp. 2443–​2446.
Kügler, F. (2007) The intonational phonology of Swabian and Upper Saxon.
Tübingen: Max Niemeyer Verlag.

Geographical clines 139

Ladd, D. R. (2008) Intonation phonology, 2nd edition. Cambridge: Cambridge
University Press.
Ladd, D. R., Schepman, A., White, L., Quarmby, L. M., and Stackhouse, R. (2009)
“Structural and dialectal effects of pitch peak alignment in two varieties of British
English”, Journal of Phonetics, 37, pp. 145–​161.
Mücke, D., Grice, M., Becker, J., and Hermes, A. (2009) “Sources of variation in tonal
alignment: Evidence from acoustic and kinematic data”, Journal of Phonetics, 37,
pp. 321–​338.
Nooteboom, S. G. (1972) Production and perception of vowel duration: A study of
durational properties of vowels in Dutch. Ph.D. Diss., University of Utrecht.
Ohala, J. (1983) “Cross-​language use of pitch: An ethological view”, Phonetica,
40, 1–​18.
Peters, J. (1999) “The timing of nuclear high accents in German dialects” in Ohala, J.,
Hasegawa, Y., Ohala, M., Granville, D., and Bailey, A. C. (eds.), Proceedings of the
14th International Congress of Phonetic Sciences (ICPhS). San Francisco: Regents
of the University of California, pp. 1877–​1880.
Peters, J. (2006) Intonation deutscher Regionalsprachen. Berlin: de Gruyter.
Peters, J., Hanssen, J., and Gussenhoven, C. (2014) “The phonetic realization of focus
in West Frisian, Low Saxon, High German, and three varieties of Dutch”, Journal
of Phonetics, 46, pp. 185–​209.
Peters, J., Hanssen, J., and Gussenhoven, C. (2015) “The timing of nuclear
falls: Evidence from Dutch, West Frisian, Dutch Low Saxon, German Low Saxon,
and High German”, Laboratory Phonology, 6, pp. 1–​52.
Petrone, C., and D’Imperio, M. (2009) “Is tonal alignment interpretation independent
of methodology?” in International Speech Communication Association (ed.)
Proceedings of Interspeech 2009. Brighton: ISCA, pp. 2459–​2462.
Prieto, P., van Santen, J., and Hirschberg, J. (1995) “Tonal alignment patters in
Spanish”, Journal of Phonetics, 23, pp. 429–​451.
Quené, H. (2008) “Andante of allegro? Verschillen in spreektempo tussen Vlamingen
en Nederlanders”, Onze Taal, 77, pp. 179–​181.
Rietveld, T., Kerkhoff, J., and Gussenhoven, C. (2004) “Word prosodic structure and
vowel duration in Dutch”, Journal of Phonetics, 32, pp. 349–​371.
Smakman, D. (2006) Standard Dutch in the Netherlands: A sociolinguistic and phonetic
description. Utrecht: LOT Publications.
Steele, S. (1986) “Nuclear accent f0 peak location: Effect of rate, vowel, and number
of syllables”, Journal of the Acoustical Society of America, 80 (Suppl. 1), p. s51.
Turk, A., Nakai, S., and Sugahara, M. (2006) “Acoustic segment durations in prosodic
research: A practical guide” in Sudhoff, S., Lenertová, D., Meyer, R., Pappert, S.,
Augurzky, P., Mleinek, I., Richter, N., and Schliesser, J. (eds.) Methods in empir-
ical prosody research. Language, Context, and Cognition, 3. Berlin; New York: De
Gruyter, pp. 1–​28.
Ulbrich, C. (2005) Phonetische Untersuchungen zur Prosodie der Standardvarietäten
des Deutschen in der Bundesrepublik Deutschland, in der Schweiz und in Österreich.
Frankfurt am Main: Peter Lang.
van Bezooijen, R. (1993) “Verschillen in toonhoogte: Natuur of cultuur?” Gramma/​
TTT, 2, pp. 165–​179.
van Es, G. A. (1935) “Syntactische functies der intonatie in de volkstaal onzer
noordelijke provinciën” in Handelingen van het Zestiende Nederlandsche Philologen-​
Congres. Groningen: Het nederlands Philologencongres, pp. 39–​42.

140 Judith Hanssen et al.

van Leyden, K. (2004) Prosodic characteristics of Orkney and Shetland Dialects: An
experimental approach. Ph.D. Diss., Leiden University.
Verhoeven, J., de Pauw, G., and Kloots, H. (2004) “Speech rate in a pluricentric lan-
guage situation: A comparison between Dutch in Belgium and the Netherlands”,
Language and Speech, 47, pp. 299–​310.
Weijnen, A. (1966) Nederlandse dialectkunde. Assen: Van Gorcum.
Wightman, C. W., Shattuck-​ Hufnagel, S., Ostendorf, M., and Price, P. J. (1992)
“Segmental durations in the vicinity of the prosodic phrase”, Journal of the
Acoustical Society of America, 91, pp. 1707–​1717.
Xu, R. (2003) “Measuring explained variation in linear mixed effects models”,
Statistics in Medicine, 22, pp. 3527–​3541.

5 A prosodic essence conjecture1

Lian-​Hee Wee

5.1 Introduction
Conventional linguistic wisdom has it that languages are tonal, stress-​accented,
or pitch-​accented, although Hyman (2006, 2009) has argued rather convin-
cingly that pitch accent is not a type,2 but a language may “pick and choose
properties from the tone and stress prototypes” (Hyman 2009: abstract).
Common definitions of tone language are as given in (1), so that presumably,
something that is not a tone language is a stress language and would have
properties laid out in (2).

(1) Definition of Tone Language

a. Yip (2002:1)
A language is a tone language if the pitch of a word can change the
meaning of the word.
b. Hyman (2001)
A tone language is one in which an indication of pitch enters into the
lexical realization of at least some morphemes.

(2) Definition of a language with Stress (Hyman 2009: (7))

A language has stress if it requires
a. Obligatoriness: every lexical word has AT LEAST one syllable
marked for the highest degree of metrical prominence (primary
stress); and
b. Culminativity: every lexical word has AT MOST one syllable
marked for the highest degree of metrical prominence

The prima facie differences in the definitions in (1) and (2) are in fact not
straightforward when checked against the reality of languages. Starting with
(1a), the definition conjures impressions of minimal pairs involving tone, but
that depends on the interpretation of “meaning”. There are instances where
tone is used to signal different syntactic categorization of what might have
roughly the same semantic core, for example, to clothe and clothes in Standard
Cantonese would both be [ji] but mid tone for the verb and high tone for the

142 Lian-Hee Wee

noun. Somali uses tone to indicate grammatical case, distinguishing the voca-
tive, genitive, and absolutive cases by the position of the high tone (Hyman
1987; Banti 1988, and more recently on grammatical tones Hyman and Leben
(forthcoming)). Stress is known to do the same. At the same time, pitch may
signal assertions or queries, as in English Yes versus yes? Thus (1a) does not
really tell us if a language is tonal unless we know what qualifies as instances
as pitch that change (word) meanings. The definition in (1b) makes no refer-
ence to meaning, but one is not quite sure how to interpret “lexical realiza-
tion”. Consider functional words. In English, hmm with a low falling pitch
is affirmation, with a high falling pitch it indicates doubt or surprise, and
with a mid flat pitch it indicates hesitation. In Ali (Bradshaw 1998) there is a
floating tone that raises tones one level up (i.e., low tones become mid and mid
tones become high), or cases where downsteps are triggered by floating low
tones, as seen in many African languages like Bambara, Ga, Ngizim, Tiriki,
Twi, and more (see Odden (forthcoming) for an excellent overview). With
(1b), one needs to determine if a particular instance counts as a lexical real-
ization through an indication of pitch. Presumably, the indication does not
refer to writing or to linguists’ transcriptions, but to what must be part of the
speakers’ phonological grammar.
Moving on to (2), “obligatoriness” is also found in tone languages, so
toneless syllables do not qualify as prosodic words. “Culminativity” is more
subtle, but there are languages like Shanghai, and other Northern Wu
languages, where apparently only the tone of the initial syllable is preserved
and other tones are deleted when syllables combine to make polysyllabic
words. It is probably not coincidental that Shanghai is left-​ prominent
prosodically. Even for languages like Kukuya (Hyman 1987), Mende (Leben
1973, 2017) and Margi (Hoffman 1963; Pulleyblank 1983), words really have
only one tone (contour) that is then shared by the number of syllables there
are, including suffixes not specified for tone. One needs to ask if a contour
can count as one tone or many, since contour tones can and do function as a
unit in some languages, such as Tianjin (Yip 1989), where OCP applies to the
contour.3 This is discussed further in Section 5.3.
The difficulty in providing a definition or even a set of diagnostics for what
counts as a tone language underlines the need for a different perspective on
the issue.
A related curiosity is that stress often appears elusive in tone languages.
For example, even though Kera (a Chadic tone language) has clear evidence
for footing through syllable weight, “overt stress does not play a major role”
(Pearce 2013: 19). It is so difficult to figure out stress for tone languages that
even highly acclaimed analysts have to go to great lengths in support of their
positions and even then may end up widely disagreeing. Standard Chinese,
for example, has been variously argued to have initial stress (Duanmu 2000/​
2007),4 final stress (Chao 1968; Wee 2008a), or no stress (Gao and Shi 1963).
Mirroring that, stress-​accented languages such as English have speakers who
find tones elusive, even though English speech clearly involve systematic uses

A prosodic essence conjecture 143

of pitch. Perhaps what is needed is a different way of conceptualizing what
has been conventionally accepted as tone language or stress (accented) lan-
guage. It is this general direction that this chapter seeks to explore.
Section 5.2 and Section 5.3 of this chapter present the similarities and
differences between tone and stress. Section 5.2 discusses arguments for why
it is untenable to think of tone and stress as two different animals. They are
more like two faces of the same coin. As a counterweight, Section 5.3 explains
the difficulties that underlie any attempt to provide a unified understanding.
A conjecture is proposed in Section 5.4 that appeals to two parametric features
of [pitch] and [contour] to capture the typologies between stress and tones.
This is followed by a conclusion.

5.2 Stress and tone: Not so different

The default position of non-​linguists and linguists alike seems to be that
tone and stress are different (though Duanmu 2002 is a notable exception),
even though few would challenge the idea that languages, regardless of tonal
or stress accented, may have prosody. Thus tone languages may identify the
locus of tonal alternations by figuring out which syllable is (not) the pros-
odic head; the treatment for vowel alternations in stress languages is similar.
This section examines three areas and raises the question of whether tone and
stress are indeed different.

5.2.1 Phonetic correlates

The typical set of phonetic correlates for stress would include Length (dur-
ation), Intensity (amplitude), and Pitch (fundamental frequency), which
I acronymize as LIP. Languages differ in how LIP is used or combined. In
English stress, pitch marking varies according to intonation. In a word like
assimilation, a high tone H* marks the stressed syllable –​la-​ in a declarative
context but would use a low tone L* if uttered as an interrogative. Syllable
length, vowel quality, and volume are also systematically used, as can be
observed in examples involving vowel and/​or diphthong reduction.
However, tones are manifested as LIP also, though pitch appears to be most
dominant in our perception. When speakers of tone languages whisper, voi-
cing is suppressed and all syllables are then non-​musical. Do speakers continue
to perceive tones? If so, what are the acoustic cues? It is easy to factor out con-
textual information by presented subjects with stimuli that are context neutral
or ambiguous. Gao (2002: ­chapter 4.3) reported that for Standard Mandarin
speakers, the correct identification of the four tones when whispered in isola-
tion or at the sentence-​final position was above chance level (the experiment
involved ten subjects), but below chance level when the stimuli occurred in
sentence-​initial or sentence-​medial positions. This finding shows that tones
can be identified even during whispering. Presumably in sentence-​initial or
medial positions, anticipatory coarticulation that results in premature or

144 Lian-Hee Wee

delayed peaks, and so on might have a negative impact on accurate identi-
fication. Gao’s study is very comprehensive, providing even laryngoscopic
images, but what concerns us here is that the tones when whispered are in
fact cued also by length and intensity. This is corroborated by Fu and Zeng’s
(2000: 46) comparison of F0 and amplitude profiles of male and female
speakers of Standard Chinese. Together with length measurements, one is left
without doubt that pitch is not the only acoustic cue. Intensity and length do
aid speakers in discerning tones, even in whispers.
At this point, we are compelled to at least acknowledge that acoustically,
both tone and stress have all of the LIP correlates, and that it would be sim-
plistic to think that the difference between tone and stress is predicated on the
usage of pitch.

5.2.2 Lexical stress and word minimality

At the level of phonology, words can be lexically specified for tone and simi-
larly for the locus of stress, as in (3).

(3) a. Lexically specified tone (Igala, see Welmers 1973: 116)

áwó ‘guinea fowl’ vs. àwó ‘a slap’
b. Lexically specified stress (English)
baNAna vs. Canada (the former not being derivable by the stress rules)

One prominent property of stress is how stress is indicative of prosodic

headship, which in turn relates to the satisfaction of foot requirements for
minimal prosodic wordhood. For example, in English, [sii] ‘sea’ and [tii] ‘tea’
are words but not *[si] or *[ti], though [siti] ‘city’ is perfectly fine. The reason
is that in [sii] and [tii] stress can be assigned to the penultimate mora to form
a foot, there being two morae in both cases. Both *[si] and *[ti] are too short,
but their concatenation would allow for stress to be assigned to the first syl-
lable, thus forming the requisite foot.
In this aspect, tone is similar to stress. In various Chinese languages that
have toneless (neutral-​toned) syllables, these toneless syllables are not accept-
able as prosodic words. For example, the Putonghua syllable men with a
rising tone is a real word meaning ‘door’; without tone, it is a plural suffix. In
Putonghua, neutral-​toned syllables do not have a stable tone value, but will
surface as a high or low tone depending on the tone of the preceding syllable.5
The parallels between stress and tone also provide a grip on why stress is
so elusive in Chinese languages. More interestingly, Chinese speakers seem to
perceive stress a bit better at the post-​lexical/​phrasal level. Duanmu (2000/​
2007: ­chapter 6), for instance, appeals to post-​lexical contrastive stress and
phrasal stress in Standard Chinese as a starting point to make inferences on
the elusive lexical stress. This leaves us with an awkward situation where stress
is attested in Chinese languages, only to be elusive at the lexical level. Yet, at

A prosodic essence conjecture 145

the lexical level, tone is readily perceived. This makes tone and stress reminis-
cent of Janus’s two faces.

5.2.3 Historical phonology

Adding to the close-​knit relation between tone and stress is McCawley’s (1978)
treatment of how three different strains of Japanese might have been related
via a proto-​Japanese that would have been tonal. His analysis is summarized
in (4).

(4) a. Standard Japanese6 (from McCawley 1978)

nága ‘vegetable’ káki ga ‘oyster’ mákura ga ‘pillow’
| |   |
nagá ‘name’ kakí ga ‘fence’ kokóro ga ‘heart’
  | | |
kakí gá ‘persimmon’ atámá ga   ‘head’
   |    |
L H L     H
sakáná gá    ‘fish’
L      H

i Two tones: H(igh) and L(ow)

ii H (accent) is assigned to word
iii 1st syllable (σ) is L except when marked for H
iv All σs preceding assigned H are H
v All σs after assigned H are L

b Central Honshu
i Requires stipulation if initial part of word is H or L
ii Proto-​Japanese is more tonal if this is taken in comparison with
Standard Japanese.

c Kagoshima (Kyushu)
Word level tone, like Mende

The main idea in (4a) is that Standard Japanese appears to assign H as an

accent on a particular syllable, perhaps analogous to stress assignment. In
Japanese, however, H may spread regressively, a feat not observed for stress.

146 Lian-Hee Wee

Examples (4b) and (4c) are languages that show a clear relation to (4a), which
McCawley proposes as reflective of an evolutionary path such as (5).

(5) Possible path of evolution to modern Japanese

Proto-Japanese (tone)

Honshu Kagoshima
(tone on initial and end of word) (tone at word level)

Also, citing Kikuyu (like Chinese), Tonga (like Honshu), and Ganda (like Std
Japanese), McCawley (1978) suggests the possibility that tone can evolve into
pitch accent, as evidenced by Bantu and Japanese.
As it turns out, there is evidence that stress can evolve into tone too. Wee and
Cheung (2015) report on Hong Kong English as being one such case. Wee and
Cheung (2015) studied the transliterations of a nineteenth-​century Cantonese-​
English bilingual dictionary and found that stressed English syllables typically
received a transliteration that had a higher (not necessarily high) tone than the
neighboring syllables. In comparison to modern Hong Kong English (which can
be historically traced to the early Cantonese English contact documented by
these early Cantonese-​English dictionaries), these syllables now receive a stable
high tone (not just higher). Now, in Hong Kong English, there are minimal tone
pairs such as cán ‘metal container’ and càn ‘canteen’. Juxtaposing these facts,
we are confronted with the possibility that the tonality of Hong Kong English
could have evolved out of a stressed language. A mitigating admission must
be made, however, that the Anglo-​Canto contact did include a tone language
like Cantonese, so it may be that tone did not arise from a stress language but
in one. Nonetheless, Kingston did argue that tones in Scandinavian languages
(e.g., the distinction between Accent 1 and Accent 2 in Swedish, Danish, and
Norwegian) and in Central and Low Franconian dialects have evolved from
stress. Tonogenesis as being triggered by prosodic prominence is not unique to
Hong Kong English, and corroboration can be found in other languages.
Thus, from a diachronic perspective, there is also historical phonological
evidence that suggests the very intimate relationship between tone and stress.

5.2.4 Prosody
Stress is normally associated with prosody and may be assigned to feet. Most
conceptions regarding tones are that tones are associated with Tone Bearing
Units (TBUs), and current wisdom favors the mora, as evidenced in Thai (Morén
and Zsiga 2006) and Chinese (Wang 2002). However, it might well be that tones
are really associated with the foot, as can be seen in Kera (Pearce 2013: 141).

A prosodic essence conjecture 147

(6) Tone patterns and foot structure in Kera trisyllabic words

(ws)s (s)(ws)
/​L/​ LLL (dɨm
̀ ɨɨ̀ )̀ mɨ̀ ‘clothes’ (bɔ̀m)(bòrɔ̀ŋ) ‘carp’
/​H/​ HHH (kǝ́kám)ná ‘chiefs’ (kúŋ)(kúrúŋ) ‘skin
/​M/​ MMM (celɛɛ)rɛ ‘commerce’ (kaŋ(kǝlaŋ) ‘hat’
/​LH/​ LLH (gǝ̀dàà)mɔ́ ‘horse’ *
LHH * (dàk)(tǝ́láw) ‘type of
/​HL HHL (kǝ́sáá)bɔ̀ ‘cricket’ *
HLL * (mán)(dǝ̀hàŋ) ‘bag’
/​MH/​ MMH (tɨlɨŋ)kɨ́ ‘hole’ *
MHH * (taa)(mǝ́káá) ‘sheep’
/​HM/​ HHM (kúɓúr)si *
‘burning coal’
HMM * (sáá)(tǝraw) ‘cat’

In Kera, foot structure is iambic and determined by vowel length; hence,

there is independent evidence for the foot types presented in (6). The data
above shows that tone is assigned to the foot because syllables in the same foot
receive the same L or H tone. HL or LH sequences occur only across the foot
boundary.7 Kera provides evidence that tone, like stress, also works at the level
of the foot, and might even venture a treatment in terms of foot structures for
cases where the TBU is the mora.

5.3 Stress and tone: Not so similar

As alluded to in the preceding paragraphs, there is a general perception that
tone and stress are different, though one now has plenty of reasons to believe
they are closely related. This section presents some aspects where tone and
stress are arguably different.
First, there are different types of tones but different degrees of stress (Hayes
1995: 25). Tones may be contoured or flat, and may have various registers.
Depending on the theory to which one subscribes, tone may even have
internal structures. That there are different types of tone makes it possible for
sandhi, since one will need at least an inventory of two tones for alternation to
happen regardless of the nature of the trigger. Stress, in comparison, does not
have such varieties, and typically stress alternates with stresslessness, some-
times accompanied by alternation of vowel type, combination, or length, as

148 Lian-Hee Wee

in English demon/​demonic where placement of stress co-​varies with the vowel
quality of the stressed or unstressed syllable.
A second difference between tone and stress is that tone can spread or be
multiply assigned to a given syllable, not stress. Stress tends to be phonetic-
ally confined to a syllable, even if intended to indicate emphasis for some-
thing as large as phrase. In Kukuya, for example, a word may have one of
five tone types: H, L, HL, LH, or LHL. These tones would be associated to
the syllables of a word from left to right, exemplified below in (7). In (7a-​c)
one sees clear cases of tone spreading, and in (7c-​e), one sees instances where
a syllable receives multiple tones. Such patterns are not attested for stresses.

(7) Tone association in Kukuya (Hyman 1987)

Tone monosyllabic disyllabic trisyllabic
a. H (mà).bá ‘oil palms’ (mà).bágá (kì).bálágá ‘fence’
‘show knives’
b. L (kì).bà (kì).bàlà ‘to build’ (kì).bàlàgà
‘grasshopper killer’ ‘to change route
c. HL (kì).kâ ‘to pick’ (kì).kárà ‘paralytic’ (kì).káràgà
‘not be entangled’
d. LH (mù).sǎ (mù).sàmí .mwàrə̀gí
‘weaving knot’ ‘conversation’ ‘younger brother’
e. LHL (.ndέ)bvi᷈ ‘he falls’ (.ndέ)pàlî (.ndέ)kàlə́gì
‘he goes out’ ‘he turns around’

Despite these differences, one is unable to define languages as tonal or not

based on them. This is because there are languages that utilize tones without
any sandhi or associative processes. Wenbao Hakka, for example, has a
tonal inventory of seven: [33, 22, 42, 31, 13, 35, 53] and 49 (= 7x7) disyl-
labic tonal combinations as none of the collocations trigger alternation (Wee
2019: ­chapter 5).
Harking back for the moment to Yip’s (2002: 1) definition in (1), a popular
distinction of tone from stress is lexical contrastiveness or the availability of
minimal pairs. As pointed out in Section 5.2.2, this is a non-​starter, given that
stress can also be lexically specified. Further, there are cases of languages that
one might probably consider tonal even with the absence of attested min-
imal pairs. In Singapore English, minimal pairs are nearly impossible to find
except for sentence-​final particles (examples of which may be found in Lim
2011a, b). Yet, Singapore English arguably employs tones (Wee 2008b; Ng
2011). Recall also grammatical tones as seen in Somali (Section 5.1). In any
case, if the lexical/​phrasal distinction is what is crucial (word and lexical being
two keys words in (1)), then what we mean by a “tone language” is more a
lexical-​phrasal distinction than a tonal/​atonal distinction. Moreover, there is
the issue of categorizing Japanese which has minimal pairs like sake where

A prosodic essence conjecture 149

high tone in the first syllable means ‘salmon’ and on the second means ‘wine’
(see Section 5.2.3 above).
To be sure, consider a list of characteristics that distinguish tone and stress
provided in Hyman (2009: (3)), presented in (8).

(8) Comparing stress and tone (Hyman 2009: (3))

form function system P-​bearer level domain

stress structural contrastive syntagmatic syllable lexical output

tone featural distinctive paradigmatic mora URs input

In (8), there seems to be nothing in common between stress and tone,

contrary to the discussion in Section 5.2. However, if we check each item,
we soon find that some distinctions are not so robust. Starting with FORM,
the issue here is one of “culminativity”, so there can only be one primary
stress, presumably calculated based on the prosodic structure of a given
word. If one accepts this, it would exclude languages where the concaten-
ation of syllables results only in the preservation of the tone of a particular
syllable, Shanghai being the case in point. As for tone being featural,8 one
wonders if this might be due to how linguists have rarely expressed the phon-
etic property of stressed syllables. It is conceivable that so-​called stressed
syllables carry a feature [+high] or [+loud] etc. FUNCTION was addressed in the
opening paragraphs of this section, noting that stress is contrastive. As shall
be argued in Section 5.4, this can be captured in a model that does not dis-
tinguish between tone and stress, but between certain properties that relate
to the use of “contours”. SYSTEM relates closely to the FORM aspect, and one
can again cite languages like Shanghai and Hangzhou where a given string
of syllables can be distinguished by which syllable is the source that provided
the tone for the overall word, hence syntagmatic. Similarly, neutral-​tone
syllables in Standard Chinese occur only with fully toned ones: in such words,
one has a tone/​toneless contrast that has often been treated as indicative of
stress/​unstressed. Further the fact that in certain “tone” languages, toneless
syllables are not acceptable prosodic words suggests also the “obligatoriness”
of tones.9 P-​BEARER is rather murky since it is possible to argue that stress falls
on the mora in a mora-​counting system. Even if one claims that the mora
is the tone-​bearing unit, it is on the syllables that tones are contoured or
level. With level, we have seen cases where stress must be stipulated, hence
possibly also applying to the underlying representations (URs). Conversely,
Singapore English and Hong Kong English use pitch so systematically that
one might argue for their tonal status. Yet tone for many of their syllables
can be derived the same way one calculates lexical stress for languages like
Plains Cree (where stress is derived by counting the syllables; Wolfart 1996;

150 Lian-Hee Wee

Okimasis and Ratt 1999; Goedemans and van der Hulst 2013). Finally, there
is DOMAIN. As noted earlier, stress can be part of the underlying form, and it
is equally possible that tone can be assigned at the output. Stress and tones
can both be assigned to word-​or phrase-​levels as well.
Taken altogether, one must recognize that there are differences but also
similarities between stress and tone. Our current understanding of tone and
stress bifurcates the two so that any similarities between them are coincidental.
However, as Section 5.2 demonstrates, such a position is untenable even in the
face of instances demonstrated in Section 5.3 where tone and stress may have
their differences. The challenge is therefore to find a conception of prosody
that would unify tone and stress.

5.4 Prosodic essence

A viable theory of prosody must be one that captures the intricate similarities
and differences between tone and stress; their similarities on the one hand and
their differences on the other. This, I believe would involve a paradigm shift
that cannot be accomplished with a paper, a book, or the influence of a single
person. I offer here a conjecture that might offer some hope, or failing that,
then perhaps some direction.

5.4.1 The conjecture

Suppose one boldly assumes that stress and tone, both related to LIP, are
really manifestations of the same underlying abstract Prosodic Essence.
Taking a step back, one can recast one’s understanding of tone as (9) and
stress as (10).

(9) “Tone” is where pitch is used consistently for indicating some kind of
prosodic contrast (lexical or postlexical) under normal phonation.

The set of prosodic contrasts would include what is a prosodic head and
what are the dependents. For example, the H-​accented syllable in Japanese is
presumably the prosodic head that then spreads its tone to the dependents.
At the word level, such prosodic contrasts would take the form of different
phonological words, for example, má ‘hemp’ and mà ‘to scold’ in Standard
Chinese.10 Tone could also be used to distinguish one construction from
another, for example, lexicalized form in Cantonese wong4 ‘yellow’ versus
wong2 ‘egg yolk’, and to indicate intonation types such as interrogatives and

(10) “Stress” is where a difference in Length, Intensity, or Pitch is used for

indication of prosodic contrast (lexical or postlexical).

The statement in (10) hardly requires elaboration. For example, in baNAna,

NA can be longer, louder, and/​or higher than the other syllables and will be

A prosodic essence conjecture 151

interpreted as stress. The main difference between (9) and (10) is in whether
or not “pitch” is consistently used for indicating some kind of prosodic con-
trast. However, “pitch” here requires some elaboration, for it certainly is not
available during whispers. Going back to the discussion in Section 5.2.1,
another angle for looking at the phonetic correlates of tone is through the
assumption that pitch is primary, with length and intensity cues as secondary
correlates. Thus under normal phonation, thinking of tone as pitch appears
to be the most straightforward approach, but in whispers, the pitch infor-
mation11 would need to be reconstructed by the hearer using the length and
intensity cues.
As mentioned in Section 5.3, an aspect related to the distinctions regarding
tone and stress is that tone can be level or contoured, although the tone may
not necessarily be on a particular syllable but can be associated to a bigger
domain such as words. However, not all languages have contoured tones.
Following these ideas of “pitch” and “contour”, a prosodic essence conjecture
may be formulated as in (11), which I shall later show might be potentially
useful in resolving the conundrum laid out in Section 5.2 and Section 5.3.

(11) A Prosodic Essence Conjecture

Languages differ in specific ways in how prosodic essence is manifest
depending on whether a language is [contour] and/​or [pitch].
Prosodic Essence (PE): The underlying prosodic contrasts employed
by a language. PEs are manifest phonetically
as Length, Intensity, and/​or Pitch (LIP).
[pitch]: A language is [pitch] if F0 is the primary
manifestation of the prosodic essence.
[contour]: A language is prosodically [contour] if that
language contrasts between level or contoured
manifestations of a prosodic essence X, where
X surfaces phonetically as LIP.

[pitch] and [contour] are parameters for how a language may express
its prosodic patterns. The definition of [pitch] allows us to cover cases of
whispering because in whispering, the intensity and length cues are intended
for the reconstruction of the F0 that is suppressed. Thus, regardless of how
prosodic contrasts are expressed in a language, phonetic Length, Intensity,
or Pitch (LIP) would all be utilized for expression of the language’s prosodic
“essence”. If modulation of LIP is involved, then that language would also be
specified for [contour].
The Prosodic Essence Conjecture in (11) promises a way of typologizing
languages in terms of how prosody might be expressed. Through the recog-
nition of [p i t c h ] but without confining it to the word level, (11) diffuses the
intractable problem of defining tone. To capture the FUNCTION differences
of “contrastive” and “distinctive”, the conjecture draws upon [c o n t o u r ],
since a language with it shall then be able to expressive prosodic essences

152 Lian-Hee Wee

that are distinctive, such as tone, and those without it shall be confined to
using whatever prosodic feature “contrastively” (in the sense of absence or
presence of phonetic cues for prosodic prominence). Thus, (11) allows for
the unification of tone and stress through the idea that both are expressions
of prosodic essence. More interestingly, it opens up a typology that would
allow for the capture of what has been traditionally described as pitch
accents without resorting to a tone-​ stress continuum and also without
requiring a tone-​stress dichotomy as advocated in Hyman (2006, 2009).

5.4.2 Predictions and typology

With two parametric features [pitch] and [contour], one would logically
expect a combination that yields four language types, outlined in (12) below,
all of which, I believe, are attested.

(12) Typology
[contour] Yes No

Yes Standard Japanese

No English Hawaiian

[C O N TOU R , PI TC H ] languages would have contours that cannot be derived

and would use pitch as a primary means of indicating prosodic contrasts.
Standard Chinese would fit this category, as would many other Chinese
languages such as Min, Hakka, Cantonese, and also African languages such
as Yoruba and Mende. The difference between Chinese and the African
languages would be attributable to whether or not the contours are lexically
specified at the word level or the syllable level. These are the languages that
have been traditionally considered tonal.
[PITC H ] languages (without [c o n t o u r ]) would employ pitch as a pri-
mary means of indicating prosodic contrasts, for example, Japanese,
Norwegian, Hong Kong English, Singapore English, Kerewe, and perhaps
French. Because there is no [c o n t o u r ], only H and L tones would be
involved, and all contour tones would be derived. In Hong Kong English,
for example, falling tones are unattested except when as a result of concat-
enation between the high tone H and a boundary tone L% (Cheung 2009;
also Wee 2015). These are the languages that appear to borderline on being
tonal and have been the subject of much discussion as to their prosodic
nature. Through recognition of [p i t c h ] and the absence of [c o n t o u r ]
as the expression of prosodic essence, their ambivalent status can be quite
easily captured.

A prosodic essence conjecture 153

[C O N TOU R ] languages (without [pitch]) would interpret prosodic contrasts
without necessary reference to pitch differences (thus, LIP enhancement
would all be perceived as “stress”), and these languages would have contours
that cannot be derived. At first blush, it is hard to see what [contour] would
mean in the absence of [pitch]. Is length and/​or intensity being modulated,
and if so how? I suspect English to be the case in point. In English, stress can
manifest as any of the LIP options. Further, one distinguishes between long
and short vowels, heavy and light syllables, and other such things that interact
with prosody to give vowel reduction and/​ or compensatory lengthening.
Vowel length alternation, and diphthong reduction appear to me to be evi-
dence of [contour] as part of the English prosodic feature. Underlying
diphthongs may be reduced depending on the locus of stress. For example,
divine [ai] versus divinity [i]‌suggests instances of underived vowel quality/​
intensity [contour] that may be neutralized by loss of prosodic promin-
ence, that is, unstressed. For purposes of comparison, notice that in Standard
Chinese, where [pitch] is a parameter, there is no vowel length contrast or
heavy/​light contrast independent of tone.
Finally, languages with neither parameter would interpret prosodic
contrasts without necessary reference to pitch differences. Thus, like [con-
tour ] languages, any of the LIP enhancements would be perceived as
“stress”. Unlike [contour] languages, languages with neither parameter will
not employ rising or falling tones, long/​short vowels, or heavy/​light syllable
distinctions. These are languages where all syllables are (C)V or (C)V(C), and
syllables may be stressed depending on whatever relevant principles of stress
assignment apply. In other words, one would be looking for a language that
shares the same syllabic structural constraints as Japanese but without the
underlyingly specified pitch accents. Hawaiian might be an example, where
stress is predictably on the penultimate mora, easily located by counting the
number of vowels, treating vowel sequences as two separate entities. Further,
Hawaiian forbids consonantal sequences within a syllable and is not known
to have consonantal codas. Thus, unlike English, Hawaiian “diphthongs” are
arguably not part of the same syllabic nucleus, removing any possibility of
treating Hawaiian as [contour]. Tones have also not been noted to play any
role in Hawaiian phonology.12 Plains Cree for which stress assignment is fixed
depending on the number of syllables (final syllable for disyllabic words, but
antepenult for longer strings with secondary stresses on alternate syllables
from the antepeult)13 but insensitive to vowel length would also fall under this
The typology of four, as given above, allows for a more subtle classifica-
tion of languages other than whether or not they are tonal while grounding
the stress/​tone distinction on a more explicit use of pitch and contours, thus
avoiding the difficulties laid out in Section 5.2 and Section 5.3. Somewhat
less fashionably, the Prosodic Essence Conjecture with its two parametric
features does not require a continuum between stress and tone.14 The typology
outlined in this section provides for only four types, which is a rather more

154 Lian-Hee Wee

stringent view of things. I submit that this is desirable for two reasons. First,
given that both stress and tone have LIP correlates, a tone-​stress continuum
conception would raise the question as to the type of scale for the continuum,
an objection that has already been raised in Hyman (2009). Secondly, a con-
tinuum would only be motivated if one can indeed demonstrate a typology
where a set of languages can be demonstrated to differ in various degrees of
a given parameter. That does not seem to have been fulfilled by the current
observations of languages. However, this approach differs from Hyman’s in
unifying both tone and stress, hence precluding any language that has both
a tone and a stress system in parallel. In other words, the Prosodic Essence
Conjecture predicts a language that has a tonal system like Chinese together
with a stress system like English to be impossible.
There is one more merit of the Prosodic Essence Conjecture. The
parameters as given in (11) and (12) can couple with prosodic hierarchies
to produce a much more intricate typology, say association of PEs to syl-
lable (e.g., Chinese), word (e.g., Kerewe), phrase levels (e.g., boundary
tones), and so forth. This allows us to tease out the effect of “tone” or
“stress” at different prosodic levels rather than restrictively to within the
word level. Note also that the Prosodic Essence Conjecture does not pre-
clude parameters such as [culminativity] and [obligatoriness] (clev-
erly used to much success in Hyman 2006, 2009), which in combination
would also generate a more fine-​grained distinction of language types. In
any case, [culminativity] could really be a property of whether [pitch]
is underlyingly specified or is calculated by rule, making it rather an epi-
phenomenon. The distinction between underlying specification and surface
patterning would also capture the distinction between languages where
stress depends on tone (i.e., underlying), for example, Mayo and Yaqui
(Hyman 2009, and references cited therein) and those where tone depends
on stress (i.e., surface patterning), for example, Punjabi (Singh (sine anno)
and Seneca (de Lacy 2002 and Hyman 2009).

5.5 Conclusion
The stress-​tone distinction of languages is problematic when one looks into
the phonetic properties and the associated phonological patterns of stress and
tone. The two appear to be so closely related as a prosodic cue that they might
conceivably be faces of the same prosodic coin. This chapter takes the counter-
intuitive approach of collapsing both tone and stress, envisioning a two-​
parameter system, [p i t c h ] and [c o n t o u r ], to reimagine our understanding
of prosody. Underlying this conjecture is, first, the recognition that what
we have understood as tone relies on pitch as a primary physical property,
a constraint not applicable to stress; and, secondly, that tones may contour.
The pairing of these parameters surprisingly yields a four-​way typology that
appears to be supported by actual known languages, offering a fresh perspec-
tive on how their prosodies might be understood. I hasten to add that the

A prosodic essence conjecture 155

unification of tone and stress under the idea of Prosodic Essence must not
be simplistically understood as the possibility of analyzing all “tone” cases as
“stress” or vice versa. The correct understanding would be that all “tone” and
“stress” cannot be analyzed without thinking of them as prosody. Nonetheless,
the Prosodic Essence Conjecture is at this stage a conjecture only, and this
chapter merely proposes that unifying tone and stress as the same prosodic
entity would produce not only a better taxonomy but also a better typology.

1 Thanks to Winnie H. Y. Cheung, Mingxing Li, and Diana Archangeli for useful
ideas and discussions. If they thought I was insane, they hid their feelings well and
tried to inoculate me with many helpful challenging questions, all of which I only
managed to fudge. Special thanks to the audience at the International Conference
of Prosodic Studies, in particular Hongming Zhang, Jianhua Hu, Shengli Feng,
Jie Zhang, and Chilin Shih, for their insights and encouragement. The research
is supported by financially by GRF-​HKBU250712, and in all other aspects by
friendships too numerous to list.
2 Pitch accent is not a prototype, or else we would have things like glottal accent and
so forth (Hyman 2006).
3 But see also Hyman (2007) for a different account, and Wee (2015), who argues in
support of the OCP account.
4 Duanmu added non-​head stress for compounds and phrases (p. 136). In the 2nd
edition published in 2007, Duanmu takes a more nuanced position, which still
highlights the elusiveness of stress in Standard Chinese.
5 See Wang (2002) for analysis of such neutral tones and their surface pitch values
of Beijing, Shanghai, and Urumqi, all evoking the mora.
6 Duanmu (2007) presents a two-​accent model for describing modern Japanese as
being superior to McCawley’s (1978) one-​accent model. The difference does not
affect the point made here.
7 Nonetheless, it should be noted that if the word has only one foot (i.e., disyllabic),
then each syllable can have a different tone, generating LH, HL, MH, and HM
sequences (Pearce 2013: 135–​136).
8 See also Clements, Michaud, and Patin (2010), Hyman (2010), and Odden
(2010). These works examine how tone features are fundamentally different from
other phonological features and may not even be easily reduced to their physical
9 The syntagmatic issue is certainly quite complex, since there are languages like
Thai, which does have words that are analyzed as underlyingly toneless (Morén
and Zsiga 2006 believe that mid-​tone syllables are underlyingly toneless), but these
are invariably heavy syllables, which I believe goes precisely to show the entangled
relationship between tone and prosody.
10 Phonological words not necessarily real words, because in Standard Chinese there
are gaps involving attested syllables without all four tonal contrasts in the list of
real words, e.g., zhuo3, nu1, la2, gui2, etc.
11 Though of course pitch can be distorted by speech rate and other factors.
12 For an excellent brief, see Wikipedia’s entry on Hawaiian Phonology, accessed 4
August 2015, https://​​wiki/​Hawaiian_​phonology.

156 Lian-Hee Wee

13 See description and data on Plains Cree (n.d.) in Wikipedia, accessed 18 August 2015,
from https://​​w/​index.php?title=Plains_​Cree&oldid=659671061
14 Diachronic shifts in languages appear to be a continuum, but this does not
challenge the view advocated in this chapter. The continuum observed in the dia-
chronic shift is probably the result of the shifts in individuals across the popula-
tion creating a gradual change in the ratio of speakers of one stage of grammar to
another, so that the two states of grammars may remain categorically distinct.

Banti, G. (1988) “Two Cushtic systems: Somali and Orono nounda” in van der
Hulst, H., and Smith, N. (eds.) Autosegmental studies in pitch accent systems.
Dordrecht: Foris, pp. 11–​49.
Bradshaw, M. (1998) “One-​ step tone raising in Ali”, OSU Working Papers in
Linguistics, 51, pp. 1–​17.
Chao, Y.-​ R. (1968) A grammar of spoken Chinese. Berkeley: University of
California Press.
Cheung, W. H. Y. (2009) “Span of high tones in Hong Kong English” in Kwon, I.,
Pritchett, H., and Spence, J., (eds.) Proceedings of the 35th Annual Meeting of
Berkeley Linguistics Society (BLS 35), Berkeley, CA, pp. 72–​82.
Clements, G. N., Michaud, A., and Patin, C. (2010) “Do we need tone features?”
in Goldsmith, J. A., Hume, E., and Wetzels, L. (eds.) (2010) Tones and
features: Phonetic and phonological perspectives. Berlin: Mouton De Gruyter,
pp. 3–​24.
de Lacy, P. (2002) “The interaction of tone and stress in optimality theory”, Phonology,
19(1), pp. 1–​32.
Duanmu, S. (2000/​ 2007) The phonology of Standard Chinese. Oxford: Oxford
University Press.
Duanmu, S. (2002) “Tone and non-​tone languages”. Paper presented at the Eighth
International Symposium on Chinese Languages and Chinese Linguistics, 8–​10
November 2002, Academia Sinica, Taipei.
Duanmu, S. (2007) “A two-​accent model of Japanese word prosody”, Toronto Working
Papers in Linguistics, 28, pp. 29–​48.
Feng, S.-​L. (2002) The prosodic syntax of Chinese. Munich: Lincom Europa.
Fu, Q.-​J., and Zeng, F.-​G. (2000) “Identification of temporal envelope cues in Chinese
tone recognition”, Asia Pacific Journal of Speech, Language and Hearing, 5(1),
pp. 45–​57.
Gao, M. (2002) Tones in whispered Chinese: Articulatory features and perceptual
cues. M.A. thesis, University of Victoria.
Gao, M.-​K., and Shi, A.-​S. (1963) Yuyanxue Gailun [Introduction to linguistics].
Beijing: Zhonghua Shuju.
Goedemans, R., and van der Hulst, H. (2013) “Weight-​sensitive stress” in Dryer,
M. S., and Haspelmath, M. (eds.) The world atlas of language structures online.
Leipzig: Max Planck Institute for Evolutionary Anthropology. Available online at
http://​​chapter/​15, accessed 17 August 2015.
Goldsmith, J. A., Hume, E., and Wetzels, L. (eds.) (2010) Tones and features: Phonetic
and phonological perspectives. Berlin: Mouton De Gruyter.
Hayes, B. (1995) The metrical theory of stress: Principles and case studies.
Chicago: University of Chicago Press.

A prosodic essence conjecture 157

Hoffmann, C. (1963) A grammar of the Margi language. London: Oxford
University Press.
Hyman, L. M. (1987) “Prosodic domains in Kukuya”, Natural Language and Linguistic
Theory, 5(3), pp. 311–​333.
Hyman, L. M. (2001) “Tone systems” in Haspelmath, M., König, E., Oesterreicher, W.,
and Raible, W. (eds.) Language typology and language universals: An international
handbook, vol. 2. Berlin; New York: Walter de Gruyter, pp. 1367–​1380.
Hyman, L. M. (2006) “Word-​prosodic typology”, Phonology, 23(2), pp. 225–​257.
Hyman, L. M. (2007) “Universals of tone rules: 30 years later” in Riad, T., and
Gussenhoven, C. (eds.) Tones and tunes: Studies in word and sentence prosody.
Berlin: Mouton de Gruyter, pp. 1–​34.
Hyman, L. M. (2009) “How (not) to do phonological typology: The case of pitch
accent”, Language Sciences, 31(2/​3), pp. 213–​238.
Hyman, L. M. (2010) “Do tones have features?” in Goldsmith, Hume, and Wetzel
(eds.), pp. 50–​80.
Hyman, L. M., and Leben, W. R. (forthcoming) “Word prosody II: Tone systems” in
Gussenhoven, C., and Chen, A. (eds.) Handbook of prosody. (Manuscript version
July 2017).
Kingston, J. (2011) “Tonogenesis” in van Oostendorp, M., Ewen, C. J., Hume, E.,
and Rice, K. (eds.) The Blackwell companion to phonology. Oxford: Blackwell, pp.
Leben, W. (1973) Suprasegmental phonology. Ph.D. Diss., Massachusetts Institute of
Technology. Distributed by Indiana University Linguistics Club.
Leben, W. R. (forthcoming) “Tone and length in Mende.” To appear in UCLA Working
Papers in Phonetics (Memorial volume for Russ Schuh).
Lim, L. (2011a) “Tone in Singlish: Substrate features from Sinitic and Malay” in
Lefebvre, C. (ed.) Creoles, their substrates and language typology. Typological
Studies in Language 95. Amsterdam; Philadelphia: John Benjamins, pp. 271–​287.
Lim, L. (2011b) “Revisiting English prosody: (Some) New Englishes as tone
languages?” in Lim, L., and Gisborne, N. (eds.) The typology of Asian Englishes.
Benjamins Current Topics 33. Amsterdam; Philadelphia: John Benjamins, pp.
McCawley, J. D. (1978) “What is a tone language?” in Fromkin, V. (ed.) Tone: A lin-
guistic survey. New York: Academic Press, pp. 113–​131.
Morén, B., and Zsiga, E. (2006) “The lexical and post-​lexical phonology of Thai
tones”, Natural Language and Linguistic Theory, 24(2), pp. 113–​178.
Ng, E-​Ching. (2011) “Reconciling stress and tone in Singaporean English” in Zhang,
L. J., Rubdy, R., and Alsagoff, L. (eds.) Asian Englishes: Changing perspectives in a
globalised world. Singapore: Pearson Longman, pp. 48–​59.
Odden, D. (1995) “Tone: African languages” in Goldsmith, J. (ed.) The handbook of
phonological theory. Oxford: Blackwell. doi: 10.1111/​b.9780631201267.1996.00014.x
Odden, D. (2010) “Features impinging on tone” in Goldsmith, Hume, and Wetzel
(eds.), pp. 81–​107.
Odden, D. (forthcoming) “Tone in African languages” in Vossen, R. (ed.) Handbook of
African languages. Oxford: Oxford University Press. Available online at www.ling.
ohio-​​~odden/​AfricanTone.pdf, accessed 22 November 2017.
Okimasis, J. L., and Ratt, S. (1999) Cree: Language of the plains /​nehiyawewin: paskwawi-​
pikiskwewin. Regina: Canadian Plains Research Center, University of Regina.
Pearce, M. D. (2013) The interaction of tone with voicing and foot structure. Stanford,

158 Lian-Hee Wee

Pulleyblank, D. (1983) Tone in lexical phonology. Ph.D. Diss., Massachusetts Institute
of Technology. (Published 1986, Dordrecht: D. Reidel.)
Singh, Sukhvinder S. (n.d.) “Tone rules and tone sandhi in Punjabi”. MS Thesis, Guru
Nanak De University, India. Available online at​10225279/​
Tone_​Rules_​and_​Tone_​Sandhi_​in_​Punjabi, accessed 22 November 2017.
Wang, J.-​L. (2002) “An OT analysis of neutral tone in three Chinese dialects”, Yuyan
Kexue [Linguistic sciences], 1(1), pp. 78–​85.
Wee, L.-​H. (2008a) “Opacity from constituency”, Language and Linguistics, 9(1), pp.
Wee, L.-​H. (2008b) “Phonological patterns in the Englishes of Singapore and Hong
Kong”, World Englishes, 27(3/​4), pp. 480–​501.
Wee, L.-​H. (2015) “Prominence from complexity: Capturing the Tianjin ditonal
patterns”, Language and Linguistics, special issue on Theoretical Aspects of
Chinese Phonology, 16(6), pp. 891–​926.
Wee, L.-​H. (2016) “Tone assignment in Hong Kong English”, Language, 92(2), pp.
Wee, L.-​H. (2019) Phonological tone. Cambridge: Cambridge University Press.
Wee, L.-​H., and Cheung, W. H. Y. (2015) “The Chinese-​English Instructor’s lesson for
Hong Kong English” in Hsiao, Y.-​C. E., and Wee, L.-​H. (eds.) Capturing phono-
logical shades within and across languages. Newcastle upon Tyne: Cambridge
Scholars, pp. 342–​388.
Welmers, W. E. (1973) African language structures. Berkeley: University of
California Press.
Wolfart, H. C. (1996) “Sketch of Cree: An Algonquian language” in Goddard, I. (ed.)
Handbook of American Indians, vol. 17: Languages. Washington, DC: Smithsonian
Institute, pp. 390–​439.
Yip, M. (1989) “Contour tones”, Phonology, 6(1), pp. 149–​174.
Yip, M. (2002) Tone. Cambridge: Cambridge University Press.

Phonological representations
based on statistical modeling
in tonal languages
Si Chen

6.1 Introduction

6.1.1 Phonetic and phonological representations and their integration

There are two approaches in treating the relationship between phonetics
and phonology summarized by Chen et al. (2017). The first approach is a
unified approach, treating phonetics and phonology as one (Flemming 2001;
Steriade 2000). The second approach makes the assumption that the two
subfields are distinct. It is claimed that phonetics studies phenomena that are
more continuous and gradient, whereas phonology studies more discrete and
categorical phenomena, and the two subfields can also show some integra-
tion (Chomsky and Halle 1968; Cohn 1990, 2007; Keating 1996; Keyser and
Stevens 2001; Kingston 2007; Hyman 2013: 4). This study adopts the second
approach, and argues that we can integrate phonetics and phonology using
statistical modeling. Specifically, it shows an approach where phonological
representations of tones can be established after statistical modeling of pitch
contours using data of Chongming Chinese tones. Section 6.1.2 offers some
background on Chongming Chinese.

6.1.2 Background on Chongming Chinese

Chongming Chinese is a northern Wu dialect, also called Haimen, Qidong, or
Qihai Chinese. It is mainly spoken in Chongming County in eastern China,
and other areas such as Haimen, Qidong city, Shazhou, Nanhui, Fengxian,
and Chuansha (Zhang 2009). The place of investigation for this study is the
city of Qidong, and the subjects recruited were around 45 years old. Zhang
(2009) used Chao’s (1930) five-​point scale, in which five stands for the highest,
and one stands for the lowest, to describe the tones. The eight tones are
described in Table 6.1, using Chao’s scale.
In Tones 1, 3, 5, and 7, only a voiceless aspirated or unaspirated obstruent
can occur at the onset, while in Tones 2, 4, 6, and 8, only a voiced obstruent
can occur due to tone split (Zhang 2009). It is thus intriguing to first examine
whether F0 contours are sufficient to distinguish these allotone pairs when

160 Si Chen
Table 6.1 Eight tones in Chongming Chinese

Middle Chinese Ping Shang Qu Ru

categories (Level) (Rising) (Departing) (Entering)

Even Oblique Oblique Even

Chongming High 1 3 5 7
allotones register 53 435/​424 33 55/​5
Low 2 4 6 8
register 24 241/​242 213/​313 23/​2

onset consonants are not present. Also, Chen and Zhang (1997) state that
there is a glottal stop marked by the symbol “ʔ” in Tone 7 and Tone 8, which
reflects the Middle Chinese stop endings /​p/​, /​t/​, /​k/​. However, a phonetic
examination shows that for every speaker, this glottal stop only occurs within
a certain proportion of speech, and the portion varies from speakers to
speakers as reported in Section 6.2. Moreover, Zhang (2009) also reports that
Tone 7 and 8 are short tones, and the duration of each tone is also examined
with statistical analysis in Section 6.2. The phonetic cues examined were then
investigated in a perceptual experiment.

6.1.3 Phonetic cues contributing to tone perception

Before proposing a phonological model, phonetic cues other than F0 contours
such as duration and glottalization reported in the fieldwork and found in the
speech production data in this study, were also tested perceptually on native
speakers to examine whether F0 contours could contrast allotone pairs with
voiced versus voiceless onsets without the onset consonants being present,
and whether other phonetic cues could still contribute significantly to the dis-
crimination with the presence of F0 contours.
Phonetic cues that may contribute to tone perception described in the lit-
erature include F0, amplitude, and voice quality. Gandour (1978) points out
that the primary correlate of pitch is fundamental frequency (F0), which is
an important feature of tones. Gandour (1978) cites a study by Abramson
(1975: 5), who superimposed averaged F0 contours, and found that subjects
achieved correct identification of tones about 93 percent of the time. Gandour
(1978) notes F0 is a primary cue in the identification of Thai tones with the
cue of amplitude contours as a secondary cue by referring to Abramson’s
(1975: 5–​8) results showing that subjects increased the correct identifica-
tion rate by 3 percent after adding amplitude contours. However, amplitude
contours alone are far from sufficient, since Abramson (1972: 36–​37) shows
that listeners cannot discriminate between tones based on amplitude contours
alone, but that amplitude contours may be more effective in distinguishing
tones in other languages as illustrated below.

Phonological representations 161

Moreover, F0 can be correlated with duration and intensity. Zhang
(2001) proposes a tendency for vowels with longer sonorous duration to
carry contours. Kong (1987) also reports duration as a correlate of tones
besides F0, amplitude, and phonation types. In the six tones of Cantonese,
the longest tone is the high rising tone. Within level tones, the mid-​level
tone is longer than the mid-​low level, which is in turn longer than high
and low tones. The correlation between F0 and duration is not linear,
judging from the graph Kong provides. The duration is positively related
to F0 until one point, and then starts to be negatively related to F0 after
that point.
Perceptually, Blicher et al. (1990) report that longer duration will enhance
the pitch cue for the third tone in Mandarin. Whalen and Xu (1992) demon-
strate that Mandarin tones have a consistent difference in duration and amp-
litude contours. They modify the tones so that only duration and amplitude
remain, without F0 or formants, and show that every tone in Mandarin except
Tone 1 can be effectively identified by amplitude contours. F0 is correlated
with the amplitude contour. However, not every language employs amplitude
as an important cue in perception. Hombert (1976) examines perception of
tones in Yoruba, finding that tones have varied amplitude, which is not an
important cue.
Other than F0, duration, and amplitude, phonation types can also con-
tribute in categorizing lexical tones. Andruski (2006) compares well-​identified
and less identified tokens, showing that well-​identified tokens are associated
with clearer phonation types, namely a larger difference in H1-​H2 (the ampli-
tude of the first harmonic minus the second harmonic) for breathy voice and
also increased jitter and shimmer for creaky voice. Increased shimmer values
in a modal token are likely to cause identification of a creaky tone. Northern
Vietnamese dialects combine the F0 cue with phonation types, while Southern
Vietnamese is considered to be restricted in using phonation types (Brunelle
2003; Pham 2003). Brunelle (2009) uses resynthesized stimuli to control
intensity and duration to test cues used in perception among northern and
southern Vietnamese speakers. For northern speakers, pitch and voice quality
are involved in the identification of tones. Low offset, rising contours char-
acterize two big categories further split by the voice quality being modal or
mid-​laryngealized. The rest are realized as another category, which is also
split by voice quality. Moreover, the final glottalization cue is considered
more important than pitch since listeners may not detect the pitch contour
effectively with final glottalization. Southern listeners use similar cues, but
voice quality is less used because the southern dialect does not have voice
quality contrasts. Brunelle argues that southern listeners are also exposed to
the standard northern tones, and can identify salient glottalization. Local
et al. (2003) also suggest that in East and Southeast Asia, pitch and phon-
ation types commonly appear together, such as creaky voice associated with
Mandarin Chinese and Tone 3 in the Tianjin dialect (Davison 1991). Belotel-​
Grenie and Grenie (1997) show that creaky voice often occurs on Tone 3 and

162 Si Chen
Tone 4, but creaky voice helps in identification of Tone 3 only, and not of
Tone 4.
In conclusion, in addition to F0 contours, other properties may contribute
to contrasting tones, including duration, intensity, and phonation types. Some
phonetic cues may play an important role besides F0 contours, as attested in
many tone languages.

6.1.4 Phonological representations of tones

Some phonological representations and models for tones are already avail-
able in the early studies. Chao (1930) proposes “tone letters”, dividing the
pitch range into four parts using values from one to five, where one stands
for the lowest and five the highest. It is a more convenient way to label low,
half-​low, medium, half-​high, and high pitch using integers. This system can
represent “straight tones”, such as 11 and 13, and “circumflex tones”, such
as 131 and 153 as well as “short tones”, such as 1 and 2. Chao’s letters are
accepted by the International Phonetic Association (1989), and are still
widely used in transcribing tones in Asian languages. Field workers tend to
modify the real values of tones for a simple representation by Chao’s letter
for visual clarity.
Goldsmith (1999) claims that autosegmental phonology provides a represen-
tation of coordinated coarticulators. This theory was proposed to resolve five
main problems from classical theories, “contour-​valued features”, such as a
contour tone consisting of two level tones; “stability”, such as the deletion of
a tone-​bearing vowel not guaranteeing deletion of its tone; “melody levels”;
“floating tone”; and “bidirectional spreading”. Autosegmental phonology
represents tones on a separate level rather than use binary features within
one segment because [+high pitch] and [-​high pitch] within one segment are
Goldsmith (1976) proposes that autosegmental representation separates
tiers for tones and tone-​bearing units (TBUs). Clements and Ford (1979)
represent this by two tiers of˜ as the TBU, and T as tones with brackets
marking the domain.
Many other models have been proposed in representing tones. Chen (2000)
summarizes the prediction of Yip’s (1990), Duanmu’s (1990) and Bao’s (1990)
models, and concludes that Yip’s model cannot explain contour and register
spread, and that Duanmu’s model cannot explain whole contour tone shift,
while Bao’s model can explain both. Autosegmental representations have
been proposed for Mandarin Chinese (Chang 1992) and Dongshi Hakka
tones (Lin 2011). The six tones of 33, 11, 31, 53, 35, and 55 can be further
represented by registers and tonal segments.
Both the representations by Chao’s letters and autosegmental structures
can be further subject to analysis by optimality theory (OT) (McCarthy and
Prince 1993, 1994; Prince and Smolensky 1993; McCarthy 1996, 2002, 2007;
Chen 2000; Yip 2002). For example, Chen (2000: 165) constructs an OT

Phonological representations 163

analysis using Chao’s letter in Boshan dialect. Yip (2002) also offers an over-
view of a constraint-​based approach to autosegmental representations.
This study proposes a way to integrate acoustic data and phonology in
offering tonal representations. The approach to transforming phonetic data
into phonological representations is based on the idea of underlying pitch
targets (Xu and Wang 2001) illustrated below.

6.1.5 Underlying pitch targets and advantages of this study

The concept of underlying pitch targets was first proposed by Xu and Wang
(2001), who distinguished between the underlying pitch targets and surface
F0 contours. There are two branches of modeling, one capturing F0 contours
on the surface directly, and the other modeling the underlying pitch targets
instead. Xu and Wang’s (2001) model belongs to the second category, which
is simpler than other models such as the command-​response (CR) model
by Fujisaki et al. (2005) (Prom-​On et al. 2009). Also, Xu and Wang’s (2001)
model is superior to previous approaches such as the IPO model (‘t Hart,
Collier and Cohen 1990), autosegmental-​metrical theory, and its ToBI exten-
sion (Pierrehumbert 1980; Silverman et al. 1992) in that other models fail
to consider appropriate articulatory mechanisms in speech production of
prosody (Xu and Prom-​on 2014).
The definition of pitch target is “the smallest articulatorially operable units
associated with linguistically functional pitch units such as tone and pitch
accent” (Xu and Wang 2001: 321). A mora and a syllable can be the host
for the underlying pitch target to implement. The underlying pitch targets
can be of two types, where one is a static pitch target with a [high] or [low]
register, and the other is a dynamic pitch target such as [rise] and [fall]. In
addition, three underlying mechanisms may influence the realization of pitch
targets: the maximum speed of pitch change, the maximum speed of changing
F0 direction, and coordination between targets and hosts.
This study proposes a way to transform from phonetic data to phono-
logical representations in Chao’s letters. In the literature, methods were
proposed for the transformation from phonetic data to Chao’s letters and
for improving Chao’s model (Shi 1990; Zhu et al. 2012; Rose 2014). Zhu
et al. (2012) include phonation types in his analysis of the tone system in
Yuliang Miao. Paterson (2015) and Rose (2014) consider the effectiveness
of Chao’s model and the challenges met in representing tones using his
model. Paterson (2015) mentions challenges from phonetic transcription
in tone systems with six level pitches, with contours establishing six levels
of pitch, and those with upstep and downstep. Rose (2014) uses percep-
tually transformed F0 values to investigate the accuracy of Chao’s model.
The evaluation is based on measurements related to the likelihoods of pitch
targets when Chao’s model is fit into the data from tones of Wencheng, a
Chinese Wu dialect of Zhejiang province, and the results show that not all
tones conform well to the model.

164 Si Chen
This study argues that statistical modeling can improve the transformation
method, and the method proposed here has the following advantages: 1)
In providing a phonological presentation, we need to determine whether to
represent a tone as a straight tone or a circumflex tone. This can be statistic-
ally tested using the current model selection procedure specified in Section
6.4. 2) We can only obtain a speech sample from the speech community and
from infinite utterances of each individual speaker. Therefore, we need some
powerful statistical tools in order to model and predict the tonal contours
of the whole speech community. Instead of using averaged F0 contours, it
is better to choose the optimal statistical model and obtain the fitted values
based on the chosen model before we transform the pitch values to Chao’s
letters. 3) This method is based on underlying pitch targets (Xu and Wang
2001; Prom-​on et al. 2009; Chen et al. 2017), instead of directly modeling
the surface contours using regression models (Andruski and Costello 2004),
which is argued to conform to articulatory mechanism (Xu and Prom-​on
2014). 4) This study uses log z-​score normalization to take the perceptual
aspect into consideration (Nolan 2003; Fujisaki 2004 as cited in Prom-​On
et al. 2009), though future studies are needed to compare it with a semitone
transformation (Rose 2014). 5) The transformation is considered by assigning
a single obtained fitted value for the adjusted onset, turning point, or the
offset, comparing it with the 20 percent, 40 percent, 60 percent and 80 per-
cent sample quantiles calculated based on all the fitted values. 6) The number
of quantiles can be specified with respect to the requirement of transform-
ation into Chao’s five-​level tone letters. The method is relatively flexible and
does not require the number of quantiles to be fixed, which can better adapt
to new tonal models proposed in the future that improve on Chao’s model
to better account for the challenges proposed by Rose (2014) and Paterson
(2015). The details of the model-​fitting procedure for this study are described
in the Section 6.4.
The structure of the paper is as follows. Section 6.2 describes the method-
ology, including subjects recruited and materials used in the phonetic examin-
ation, and presents the phonetic results. Section 6.3 introduces the perceptual
experimental design based on the findings in Section 6.2, and presents the
results with respect to accuracy rate and reaction time. Section 6.4 discusses
the transformation, method including the procedure of statistical modeling
and the assignment of tone values. Section 6.5 offers some general discussion,
including future studies, and draws a conclusion.

6.2 Phonetic examination

6.2.1 Subjects and materials

This study recruited subjects from the city of Qidong, and all subjects chosen
were around 45 years old with minimal contact from other languages and

Phonological representations 165

dialects. Fifteen male and female speakers were recruited, and no partici-
pant a reported history of speaking, hearing, or language difficulty. They
received financial compensation for participation. The University of Florida
Institutional Review Board approved the experimental procedures.
In this study, the participants were recorded reading 1,080 monosyllabic
tokens (12 monosyllables * 3 replications * 30 speakers). The vowel was
controlled in this study to avoid different vowel height intrinsically related
to pitch height (House and Fairbanks 1953; Hombert 1977). The syllable
CV was used with V = /​æ /​and C =/​t, th, d/​throughout the study, and
Chinese characters were used to elicit data from participants since the char-
acter was unique, and participants could recognize characters effectively.
Because Chongming Chinese shows extensive possible sandhi rules (Chen
and Zhang 1997; Zhang 2009), in order to avoid obtaining affected tonal
contours, we recorded tones in their citation form. Recording was conducted
in a quiet room using a Marantz PMD 660 digital recorder and a Shure SM2
headset with a microphone, and all the participants were instructed about
the recording procedure in Chongming Chinese by a native-​speaker helper.
The recording was transferred to a PC with a sampling rate of 48 kHz using
a USB cable.

6.2.2 Segmentation and phonetic examination

The vowels were segmented manually according to the standard speci-
fied below, and then F0 values were extracted by a Praat script written by
Byunggon Yang, and edited by Jirapat Jangjamras (Boersma and Weenink
2013). F0 values were extracted at 20 points of the target vowel with an ana-
lysis window size of 25.6 ms, and they were saved for further analysis. The
segmentation criteria described in Jangjamras (2012) and Zhang et al. (2008)
were followed, where wide band spectrograms, waveform displays, and audio
clues were all considered in the segmentation. The extracted F0 values were
then subject to statistical modeling for transformation into phonological
representations described in Section 6.4. Other phonetic examinations such as
duration and glottalization for each tone were then evaluated, and the results
are reported in Section 6.2.3.

6.2.3 Results of production data on duration and glottalization

A close phonetic examination of each speaker shows that glottalization does
not always accompany Tone 7 and Tone 8. Table 6.2 shows the portion of
glottalization in monosyllabic tones by 30 speakers, and Figure 6.1 offers
a visualization of the percentage. From Table 6.2 and Figure 6.1, it can be
inferred that glottalization does not always accompany Tones 7 and 8, and
it has extensive variations among speakers. On average, 67 percent of Tone 7
and 77 percent of Tone 8 are accompanied by glottalization.

166 Si Chen

Proportion of glottalization on Tone 7 and 8 by 30 speakers


Proportion of glottalization




0 10 20 30

Tone 7 Tone 8

Figure 6.1 Proportion of glottalization on Tones 7 and 8 by 30 speakers

In addition to glottalization, it is also mentioned in the field records that

Tones 7 and 8 are of shorter duration (Zhang 2009). Phonetic examination on
the duration is also conducted to examine this claim. A summary of the mean
duration and a visualization of it for each tone are presented in Table 6.3 and
Figure 6.2.

Phonological representations 167

Table 6.2 Proportion of glottalization accompanying Tone 7 and Tone 8

Speaker Tone 7 Tone 8 Speaker Tone 7 Tone 8

1 0.93 1.00 16 0.87 0.89

2 0.33 0.33 17 0.67 0.89
3 0.87 1.00 18 0.60 1.00
4 0.80 0.78 19 0.73 0.44
5 0.87 0.89 20 0.20 0.67
6 0.67 0.78 21 0.93 0.89
7 0.47 0.78 22 0.87 0.89
8 0.67 0.89 23 0.87 1.00
9 0.27 0.33 24 0.40 0.67
10 0.27 0.22 25 0.47 1.00
11 0.60 0.56 26 0.40 1.00
12 0.53 0.33 27 0.93 1.00
13 0.87 0.67 28 0.87 1.00
14 0.93 0.89 29 0.80 0.89
15 0.87 0.89 30 0.60 0.44

The ANOVA results for testing the duration of Tone 1~7 show that
they are of significantly different duration (F(6, 623) = 26.78, p < 0.001).
A similar analysis was made to test whether Tone 8 is shorter than other
tones. The ANOVA results, including Tone 1 ~ 6 and Tone 8, also show
that these tones have significantly different duration (F(6, 623) = 35.29,
p < 0.001).
For post-​hoc analysis, I used Dunnett’s test, treating Tone 7 (T7) and Tone
8 (T8) as a control group in testing the differences in duration with other
tones. From Tables 6.4 and 6.5 as well as Figures 6.3 and 6.4, it can be inferred
that the duration of Tone 7 and Tone 8 is significantly different from other
tones except for Tone 1. Thus, we can conclude that Tones 7 and 8 are shorter
than most other tones.

6.3 Perceptual study

6.3.1 Introduction to the perceptual experiment

As mentioned in Section 6.1, Tones 1, 3, 5, and 7 only occur after a voiceless
aspirated or unaspirated obstruent, whereas Tones 2, 4, 6, and 8 only occur
after a voiced obstruent (Zhang 2009). Therefore, they can be deemed as
allotone pairs with respect to voicing in the onset. Section 6.2 also discusses
differences in duration among these tones, and a proportion of glottalization
accompanying Tone 7 and Tone 8 in the collected speech data and fieldwork

Duration values of eight tones


Tone 1
Tone 2
Tone 3

Tone 4
Tone 5
Tone 6
Tone 7
Tone 8


Tone 1 Tone 2 Tone 3 Tone 4 Tone 5 Tone 6 Tone 7 Tone 8


Figure 6.2 Duration values of eight tones

95% family-wise confidence level

2–1 ( )
3–1 ( )
4–1 ( )
5–1 ( )
6–1 ( )
7–1 ( )

0 50 100 150
Linear Function

Figure 6.3 The plot for the result of the Dunnett’s test (control group: Tone 7)

Phonological representations 169

Table 6.3 The duration time for each tone

Tone Duration

Tone 1 249.01
Tone 2 328.84
Tone 3 380.63
Tone 4 294.20
Tone 5 370.80
Tone 6 370.60
Tone 7 248.43
Tone 8 228.06

Table 6.4 The results of the Dunnett’s test (Tone 7 as a control group)

Pair Estimate Lower Bound Upper Bound Pr (>|t|)

T1-​T7 0.58 –39.75 40.90 1.00

T2-​T7 80.41 40.08 120.74 <0.001 *
T3-​T7 132.20 91.87 172.53 <0.001 *
T4-​T7 45.77 5.44 86.10 0.019 *
T5-​T7 122.37 82.04 162.70 <0.001 *
T6-​T7 122.17 81.84 162.50 <0.001 *

Table 6.5 The results of the Dunnett’s test (Tone 8 as a control group)

Pair Estimate Lower Bound Upper Bound Pr (>|t|)

T1-​ T8 20.96 –17.00 58.91 0.52

T2-​ T8 100.79 62.83 138.74 <0.001 *
T3-​ T8 152.58 114.63 190.53 <0.001 *
T4-​ T8 66.14 28.19 104.10 <0.001 *
T5-​ T8 142.74 104.79 180.70 <0.001 *
T6-​ T8 142.54 104.59 180.50 <0.001 *

Before proposing tonal representations based on statistical modeling of

phonetic data, this study conducts a perceptual study to examine whether
F0 contours are perceived to be different when the onset is not present, and
whether other cues such as onset consonants, duration, and glottalization
contribute to discrimination of allotone pairs after voiced versus voiceless
onsets when F0 contours are present.
Some common methods in speech perception research include discrimin-
ation tasks such as AX tasks, as well as identification tasks such as Yes-​No
and forced choice identification tasks (McGuire 2010). In the discrimination
AX (same-​difference) task, the participant determines whether two stimuli

170 Si Chen

95% family-wise confidence level

2-1 ( )
3-1 ( )
4-1 ( )
5-1 ( )
6-1 ( )
7-1 ( )

0 50 100 150
Linear Function

Figure 6.4 The plot for the result of the Dunnett’s test (control group: Tone 8)

in a trial are the same or different (Gerrits and Schouten 2004). Usually,
the number of same trials should match the number of different trials. The
advantages are that the design is simple so that the differences and similarities
do not need to be described to the participants, and they do not need to know
a particular label (McGuire 2010). Also, the reaction time is reliable and easy
to measure because participants make decisions based on the second stimulus.
The disadvantages are that when the task is difficult, subjects tend to respond
“same” more (McGuire 2010).
Identification tasks usually require the participants to give specific labels
for presented sounds. One type of this kind of identification task is the yes-​
no task, where subjects are asked whether a certain stimulus is present or
whether it was x or y. It has the advantage of simplicity, but it has the dis-
advantage that there are no “direct comparisons of stimuli in each trial”. In
forced-​choice identification tasks, the subject has to provide a label to the
stimulus presented. For example, the subject can be asked to label which
allotone they have heard in a tonal language. This design is also simple,
but when the response set is big, the analyses become difficult. Also, this
forced choice identification task forces subjects to make categorical decisions
(McGuire 2010).
Since my goal in the perceptual experiment is to test whether there are
other contributing cues in addition to F0 contours, it is enough to use the
AX task, which is simple to implement. Reaction time is also used as one of
the response variables, and the AX task can provide a reliable measurement
of it, as mentioned above. Moreover, some direct comparisons are needed in
each trial, which the identification tasks cannot provide. Subjects may also be
confused about the labeling of allotones, which they are not educated for. So
it may require more training on labeling, which introduces the likelihood of
error. The next section illustrates some concerns about ISI conditions before
a description of the experiment setup.

Phonological representations 171

6.3.2 Concerns about ISI conditions

Fujisaki and Kawashima (1969) propose a dual-​factor model, where under
some conditions stimuli are processed according to phonetic categories,
while under other conditions subjects can further discriminate the stimuli
on a finer scale. Werker and Logan (1985) propose three factors in speech
perception: auditory, phonetic, and phonemic processing. They define phon-
emic processing as perception based on phonological categories of the native
language. Phonetic processing is proposed to account for perception not
according to phonological categories of the native language but phonetic
distinctions in any language. Auditory processing can explain perception that
has no phonetic boundaries.
Werker and Logan (1985) used stimuli of consonant contrasts in Hindi
to test native English speakers. They made three types of pairs: “physically
identical (PI) pairings”, “name-​ identical (NI) pairings”, “different (DIF)
pairings”. The PI pairings are identical and the NI pairings are non-​identical
within Hindi phonetic categories. The PI and NI pairings are referred to
as within-​category pairings, while DIF are different pairings, referred to as
between-​category pairings. These pairings were represented under different
ISI conditions: 250 ms, 500 ms, and 1500 ms. Their results showed phonemic
perception with 1500 ms, phonetic processing with 250 ms and 500 ms, and
auditory processing in 250 ms.
Burnham and Francis (1997) used five Thai tones on the same syllable [ba:]
to create trials for a discrimination task. The same phone pairs (AA or BB)
and different pairs (AB and BA) were presented to native Thai and native
English speakers with different ISI conditions (500 and 1500 ms). Their results
showed an interaction between ISI conditions and language backgrounds.
Native Thai speakers discriminated tones better at 1500 ms, and English
speakers did better at 500 ms. They argued that Thai speakers processed Thai
tones at high levels, namely in the phonological mode, while English speakers
processed those tones in the phonetic mode.
However, the results from Wayland and Guion (2003) were slightly
different. They did a perception test of two Thai tones on native English
speakers with and without experience with Thai, and also used Thai native
speakers as a control group. Two interstimulus intervals (ISIs) conditions
of 500 ms and 1500 ms were used in order to test effects of ISI conditions.
The results only showed an ISI effect for the two most difficult tone contrasts
but not for the overall data. The simple effect of ISI is only significant for
experienced English learners of Thai.
In the present study, I used an ISI of 1500 ms since most scholars agree
that 1500 ms is the time needed for phonological processing. It is also possible
to treat the ISI condition as a separate variable in the above design with two
choices of 500 ms and 1500 ms. However, the perceptual study is too long
to gain effective results in that manner. Hence, I only used the 1500 ms ISI

172 Si Chen
as a default choice for phonemic processing. Details about the stimuli and
participants are illustrated in the next section.

6.3.3 Materials and participants

Considering duration, onset consonants, and glottalization, 16 sets of stimuli
are created, as shown in Table 6.7. Each set manipulates the eight allotones
in the onset consonants, duration, or glottalization. In order to test the main
effects of onset consonants, duration, and the phonation types, I used the
fractional factorial design. This design is known to be efficient in investi-
gating more than one treatment factor, which accommodates the need for
interpreting one treatment factor while considering other treatment factors.
An introduction to fractional factorial design can be found in Kuehl (2000).
An illustration of those main effects is provided below.
The allotones were manipulated based on the natural utterances of a female
native speaker of Chongming Chinese. The stimuli were excised from the ori-
ginal CV syllables uttered by a native speaker. The averaged values of pitch
were superimposed on the same target syllable /​æ/​. F0 manipulations were
accomplished manually, and the modification of duration was done using a

Table 6.6 Pairs of allotones to be compared

Pairs Pairs Pairs Pairs Pairs Pairs Pairs

T1 T2 T2 T3 T3 T4 T4 T5 T5 T6 T6 T7 T7 T8
T1 T4 T2 T5 T3 T6 T4 T7 T5 T8
T1 T6 T2 T7 T3 T8
T1 T8

Table 6.7 An example of the fractional factorial design with three variables

Allotone Duration Glottalization Onset


1a1a + -​ -​
Same 1b1b -​ + -​
Trials 1c1c -​ -​ +
1abc1abc + + +

1a2a + -​ -​
Differ 1b2b -​ + -​
-ent 1c2c -​ -​ +
1abc2abc + + +

Phonological representations 173

Praat script by Chris Darwin, employing the “Pitch-​Synchronous Overlap
and Add” (PSOLA) function in Praat. The algorithm divides the original
signal into discrete and overlapping analysis signals, and modifies each ana-
lysis signal to synthesis signal (by repetition or omission for lower or higher
F0), and recombines the segments by overlap-​adding (Charpentier et al. 1986;
Valbret et al. 1991; as cited in Lemmetty 1999).
The portion starting with the first glottal cycle without an onset con-
sonant was chosen for the test, in order to exclude information from the onset
consonants and the transition part. Also, duration and glottalization were
considered as variables. The pairs of allotones to be compared are listed in
Table 6.6, which include all the pairs following voiceless aspirated and voiced
onset consonants. So all the pairs with a difference in the reported voicing of
onset consonants are included.
Table 6.7 displays the factor levels of each factor: onset consonants, dur-
ation, and phonation types within each set. For the first column, “Allotone”,
the pairs of allotones are listed as well as letters “a”, “b”, “c”, and “abc”,
which stand for different trials, where each trial has different combinations of
the signs of “+” and “-​”. For the columns “Duration”, “Glottalization”, and
“Onset Consonant”, there are two levels within each factor: “-​” (No) and “+”
(Yes). The onset consonant was either removed (indicated by “-​(No)”) or it
was retained (indicated by “+ (Yes)”). If the mean values of the duration from
the real data are not used (indicated by “-​(No)”), then the normalized values
of the duration are used. Normalized duration values mean that the stimuli
were created so that there are no differences in duration across allotones. No
information about glottalization means that the glottalized portion of the
stimuli was removed, and the mean values of the remaining portion of the
contour were calculated. The intensity values are scaled for all the stimuli.
In sum, statistically speaking, the goal is to test whether three variables
have significant main effects in perceptual experiments:

A Duration: normalized duration (-​); non-​normalized duration (+)

B Glottalization: without glottalization (-​); with glottalization (+)
C Onset Consonant: without a consonant (-​); with a consonant (+)

An illustration of the signs “+” and “-​” for each variable is listed as follows.
Duration (A):
“+” means that the specific mean duration of a certain allotone calculated
from the production data was used.
“-​” means that the overall grand mean of all allotones was used as a
normalized duration value.

Glottalization (B):
“+” means that a glottal stop appeared at the end of the vowel.
“-​” means there was no glottal stop or irregular pulses.

174 Si Chen
Onset Consonant (C):
“+” means the onset consonant of the allotone was present.
“-​” means the onset consonant was truncated, and the vowel starts at
the first zero-​crossing point in the first cycle. The amplitude con-
tour was adjusted to be a gradual slope from intensity value zero to
the original value at the 25th millisecond to avoid abruptness of the

A total of 128 trials are created with an equal number of the same and
different trials. The different trials consist of 16 pairs and four types of
“+” and “-​” combinations, as listed in Table 6.7. In this perceptual study,
16 listeners were recruited, including eight females and eight males, who
were around 45 years old. The participants have lived in Qidong city for
most of their lives, with minimal contact from other languages and dialects.
No participants reported a history of speaking, hearing, or language diffi-
culty, and they received financial compensation for their participation. The
University of Florida Institutional Review Board approved the experimental
procedures. A Shure SM2 headset was used for the perceptual experiment.
The modified stimuli were presented to the participants using the software E-​
prime, and they were trained in a training session with feedback on whether a
response is correct or not before the real trials started. In the training session,
participants were trained to familiarize themselves with the allotones and the
corresponding plotted graphs, and to press the button standing for “same”
when they believed the same allotones occurred in one trial, and “different”
if they believed different allotones were presented.
There are two kinds of response values collected from the participants. The
first response is the A’ value, which is calculated from correct and incorrect
responses to the stimuli. All the same and different trials consist of 50 percent
of the data to achieve a balance. The A’ value is set to 0.5 when the hit rate
(HA) is the same as the false alarm rate (FA). It is calculated for the other
conditions as follows (Snodgrass et al. 1985: 451):

1. H>FA, A’=0.5+(H-​FA)(1+H-​FA)/​(4H(1-​FA))
2. H<FA, A’=0.5-​(FA-​H)(1+FA-​H)/​(4FA(1-​H))

The A’ values are calculated for each set of a, b, c, and abc, where each
set has different “+” and “-​” values for the three variables to be tested. The
A’ values for all the sets a, b, c, and abc are averaged in order to test whether
there are main effects of duration, glottalization, and onset consonants.
A similar procedure is conducted on the second response, namely reaction
time (RT). Average reaction time is also used to test main effects of duration,
glottalization, and onset consonants. In order to examine whether these cues
contribute to certain allotone pairs, t-​tests are conducted, as illustrated in the
next section.

Phonological representations 175

6.3.4 Results for the perceptual experiment

For monosyllabic data, the mean of reaction time in each group is shown in
Table 6.8, where the letters a, b, c, and abc represent different combinations
of “+” and “-​” for the three variables duration, glottalization, and onset
consonants. The reaction time and accuracy rate for each pair with the com-
bination “a, b, c, and abc” tested are listed in the Appendix.
The Parto chart and the half-​normal plot of these effects are shown in
Figures 6.5 and 6.6. For the Pareto chart, no effects (A, B. and C) cross the
straight line, which signals significance. Similarly, for the half-​normal plot,
the effects A (duration), B (glottalization), and C (onset consonant) do not
deviate much from the straight line, suggesting they might be insignificant.
However, from the above graphs, it seems factor A and B have a bigger effect
than C; therefore, factors A and B are included in the model in order to
obtain the ANOVA results to test the significance numerically. The results
for the ANOVA are shown in Table 6.9. From the ANOVA table, we can
thus conclude that the main effects are not significant concerning reaction
time due to insignificant p values. A similar procedure is conducted in exam-
ining the accuracy rate. A high accuracy rate is achieved across tonal pairs
and combinations. The mean A’ value for each group is shown as follows in
Table 6.10.
The Pareto chart and half-​normal plot of the main effects are shown in
Figures 6.7 and 6.8. Similarly, no effects cross the straight line in the Pareto
chart, and no effects deviate significantly from the straight line in the half-​
normal plot. We can include factor B, “Glottalization”, as a potentially sig-
nificant factor since it is closest to the straight line in the Pareto chart, and
deviates the most from the line in the half-​normal plot. The results of the
ANOVA tests are shown in Table 6.11.
From Table 6.11, we can conclude that there are no significant main effects
of the proposed three factors: duration, glottalization, and onset consonants,
when we consider perception of the allotone pairs on the whole. From the
results of the A’ value and reaction time, the three effects tested do not seem
to contribute significantly in discriminating the pairs in addition to the F0

Table 6.8 Reaction time for each group

Allotone/​Factor Duration Glottalization Onset Consonant Reaction Time

a + -​ -​ 1281.27
b -​ + -​ 1303.75
c -​ -​ + 1295.31
abc + + + 1294.93

Pareto Chart of the Effects

(response is Reaction Time, Alpha = .05)

Factor Name

B A Duration
B Glottalization
C Onset Consonant

0 100 200 300 400

Lenth’s PSE = 33.144

Figure 6.5 The Pareto chart of the main effects (reaction time)

Half Normal Plot of the Effects

(response is Reaction Time, Alpha = .05)

Effect Type
98 Not Significant
Factor Name
90 A Duration
85 B Glottalization

80 C Onset Consonant

0 10 20 30 40 50 60 70 80 90
Absolute Effect
Lenth’s PSE = 33.144

Figure 6.6 The half-​normal plot of the effects (reaction time)


Table 6.9 ANOVA table of reaction time for each group

Source of Degrees of Sum of Mean Square F Pr > F

Variation Freedom Squares

Total 3 259.40 86.47

Duration 1 130.53 130.53 19.17 0.14
Glottalization 1 122.058 122.058 17.92 0.15
Error 1 6.81 6.81

Table 6.10 A’ value for each group

Allotone Duration Glottalization Onset Consonant A’ value

a + -​ -​ 26.72
b -​ + -​ 25.91
c -​ -​ + 26.53
abc + + + 26.41

Pareto Chart of the Effects

(response is A' value, Alpha = .05)

Factor Name

B A Duration
B Glottalization
C Onset Consonant

0 1 2 3 4 5 6 7
Lenth’s PSE = 0.515625

Figure 6.7 The Pareto chart of the main effects (A’ value)


178 Si Chen

Half Normal Plot of the Effects

(response is A' value, Alpha = .05)

Effect Type
98 Not Significant
Factor Name
90 A Duration
85 B Glottalization

80 C Onset Consonant

0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4
Absolute Effect
Lenth’s PSE = 0.515625

Figure 6.8 The half-​normal plot of the effects (A’ value)

Table 6.11 ANOVA table of A’ value for each group

Source of Degrees of Sum of Mean Square F Pr > F

Variation Freedom Squares

Total 3 0.36 0.12

Glottalization 1 0.12 0.12 0.97 p = 0.43
Error 2 0.24 0.12

Since a glottal stop is reported to follow Allotone 7 and Allotone 8, and

they have shorter duration (Chen and Zhang 1997), I tested whether the
factor duration and glottalization contribute significantly in discriminating
these allotones specifically. For glottalization, the results of the two groups
with and without glottalization are included in Tables 6.12 and 6.13, where
the accuracy rate is transformed using arcsine transformation. Tables 6.14
and 6.15 present the results of two groups with and without normalized
Tables 6.12 to 6.15 show no significant results obtained with respect to
accuracy rate and reaction time for the specific allotones tested. Glottalization

Phonological representations 179

Table 6.12 A t-​test for accuracy rate and reaction time concerning glottalization
(Allotone 7)

Groups Accuracy (after arcsine Reaction time


Mean Standard P-​value Mean Standard P-​value

Deviation Deviation

With glottalization 1.36 0.09 t(190) = 0.20 1505.27 908.15 t(169) =

(T2aT7a, T4aT7a, p = 0.84 0.19
T6aT7a, T2cT7c, p = 0.85
T4cT7c, T6cT7c)
Without 1.35 0.09 1483.98 626.06

and duration do not seem to contribute significantly in perceptually discrim-

inating Allotone 7 and Allotone 8 from other tones tested, though in pro-
duction, glottalization and short duration are reported to accompany these
In sum, the perception experiment uses a fractional factorial design to
manipulate allotone pairs in order to test whether three factors may con-
tribute to discrimination of allotone pairs when F0 contours are present. The
factors tested are onset consonants, duration, and glottalization. The results
from the Pareto chart, half-​normal plot, and the ANOVA test show that these
factors do not contribute significantly in the discrimination task. Although
short duration and glottalization are reported for Allotones 7 and 8 in pro-
duction, they do not seem to contribute significantly in discriminating these
allotones when F0 contours are present.
For future studies, more experiments need to be designed that can also test
the identification of allotones in addition to the discrimination tasks used in
this study in order to test whether these cues may contribute significantly in
improving identification rate and reaction time associated with the identifi-
cation of allotones. Also, more allotone pairs with respect to a difference in
aspiration of onsets may also be tested in order to compare the weighting of
cues in discrimination and identification tasks of allotone pairs with a voicing
difference in the onsets.
In this perceptual study, F0 contours seem to be enough to discriminate
allotone pairs with respect to voicing in the onset. The next step is to statis-
tically model the speech data, and to provide a tonal representation in Chao’s
letters, as discussed in Section 6.4.

180 Si Chen
Table 6.13 A t-​test for accuracy rate and reaction time concerning glottalization
(Allotone 8)

Groups Accuracy (after arcsine Reaction time


Mean Standard P-​value Mean Standard P-​value

Deviation Deviation

With glottalization 1.36 0.092 t(190) = 1479.87 645.91 t(167) =

(T1aT8a, T3aT8a, -​0.57 -​0.73
T5aT8a,T1cT8c, p = 0.57 p = 0.47
T3cT8c, T5cT8c)
Without 1.37 0.099 1565.87 954.54
(T1bT8b, T3bT8b,
T1abc T8abc,

Table 6.14 A t-​test for accuracy rate and reaction time concerning duration (Allotone 7)

Groups Accuracy (after arcsine Reaction time


Mean Standard P-​value Mean Standard P-​value

Deviation Deviation

With normalized 1.37 0.099 t(184)= 1.49 1556.30 842.39 t(184) =

duration p = 0.14 1.10
(T2bT7b, T4bT7b, p = 0.27
T2cT7c, T4cT7c,
Without normalized 1.35 0.082 1432.95 706.83
(T2aT7a, T4aT7a,

6.4 Phonological tonal representations based on statistical modeling

6.4.1 Underlying pitch targets

As mentioned in Section 6.1, we provide phonological tonal representations
based on statistical modeling, and the idea of underlying pitch targets is used

Phonological representations 181

Table 6.15 A t-​test for accuracy rate and reaction time concerning duration (Allotone 8)

Groups Accuracy (after arcsine Reaction time


Mean Standard P-​value Mean Standard P-​value

Deviation Deviation

With normalized 1.36 0.097 t(190) = 0.19 1603.97 965.85 t(162)= 1.38
duration p = 0.85 p = 0.17
(T1bT8 b, T3bT8b,
T1cT8 c, T3cT8c,
Without normalized 1.36 0.095 1441.76 621.23
(T1aT8 a, T3aT8a,
T1abcT8 abc,

in modeling. Xu (2005) proposes that the underlying target might be curvi-

linear in some languages. The present study employs the modeling procedure
proposed by Chen et al. (2017), in which the degree of the underlying target
is determined by statistical testing. The model selection procedure is based on
models proposed to quantify the underlying pitch target.
Sun (2001) uses non-​linear regression methods to estimate the underlying
pitch target. The model he uses is as follows:

T (t ) = at + b

y (t ) = βe − λt + at + b

where T(•) represents the underlying target, and y(•) represents F0 values
on the surface. When t = 0, the coefficient β is the distance between F0 con-
tour and the underlying pitch target. The parameter λ represents the rate
of approaching the target. Wong (2006) uses a similar model to predict the
underlying pitch targets for Cantonese tones.
Prom-​On et al. (2009) propose a third-​order critically damped system,
which constrains the variable control parameters. The model has the form

x (t ) = mt + b

f0 (t ) = (c1 + c2 t + c3t 2 )e − λt + sx (t )

182 Si Chen
where f0(t) is the response of frequency, the underlying pitch target is x(t),
and λ represents the rate of approaching the target. The three parameters are
determined by the initial F0 values, initial velocity, and initial acceleration.
The procedure for selecting the models and testing for the polynomial
degree follows Chen et al. (2017). First, four models were fit using non-​linear
regression models. The four models are the following:

1 a simple model (sim_​1) with polynomial degree = 1

y (t ) = βe − λt + at + b

2 a more complex model (com_​1) with polynomial degree = 1

y (t ) = (c1 + c2 t + c3t 2 )e − λt + at + b

3 a simple model (sim_​2) with polynomial degree = 2

y (t ) = βe − λt + dt 2 + at + b

4 a more complex model (com_​2) with polynomial degree = 2

y (t ) = (c1 + c2 t + c3t 2 )e − λt + dt 2 + at + b

In order to find the plausible initial values, the following steps are taken:

1 I plot the function in “sim_​1”, and change the parameters so that the
shape is similar to the curve connecting the mean values of y to obtain
initial values.
2 To fit “com_​1”, I first use the estimated values for λ, a and b obtained by
fitting “sim_​1”.
Then I use the estimated value of β obtained by fitting “sim_​1”, and
add a white noise term to regress β onto t and t2 to obtain an initial esti-
mate for c1, c2, and c3.
3 To fit “sim_​2”, I first use the estimated λ value by fitting “sim_​1”, and
create a covariate called et that equals exp(-​λt), and regress y onto et, t
and t2 to obtain the value for β, d, a and b.
4 To fit “com2”, I first use the estimated value for λ, d, a and b by fitting
“sim_​2”. Then I use the estimated value of β obtained by fitting “sim2”,
and add a white noise term to regress β onto t and t2 to obtain an estimate
for c1, c2, and c3.

After fitting these models, a comparison procedure was conducted to

choose from the four models. I chose the model with the least AIC, which

Phonological representations 183

is Akaike’s information criterion, a criterion for choosing the model that fits
best (Kim and Timm 2006).
Notice from Table 6.16, the chosen best model for Tones 3, 4, 5, and 6 has
an underlying tonal target of a polynomial degree represented by the coef-
ficient d, whereas the rest of the tones have a linear underlying tonal target.
Plugging in the coefficients in Table 6.16 for each model of tones, we can plot
the fitted tonal contours and the mean tone contours from the original data
for comparison. From the plots in Figures 6.9 to 6.16, the dotted lines are
the average mean values from the original data, and the solid line is the fitted
values from the model. It can be seen that these two contours are very close to
each other, and that the contours plotted from the fitted models are smooth.
Based on the chosen models for each tone, the fitted values obtained from
the model were calculated using time points 1~20, and four sample quantiles
(20 percent, 40 percent, 60 percent, and 80 percent) are found based on all
the fitted F0 values, where the corresponding cut-​off values and the trans-
formation for Chao’s five points are listed in Table 6.17. In order to alleviate
perturbation effects from the onset consonants, we make use of the averaged
fitted F0 value at the first and second time point at the onset instead of
merely one single fitted F0 value at the first time point, in order transform
it according to the standard of quantiles. This method sorts out F0 values
according to quantiles and categorizes F0 values into five categories evaluated
against each quantile.

Comparison of the Fitted Tone 1 Contour and the Mean Tone 1 Contour


Log normalized Frequency





5 10 15 20
Time Points

Fitted Tone Mean

Figure 6.9 Tone 1

Comparison of the Fitted Tone 2 Contour and the Mean Tone 2 Contour
Log normalized Frequency




5 10 15 20
Time Points

Fitted Tone Mean

Figure 6.10 Tone 2

Comparison of the Fitted Tone 3 Contour and the Mean Tone 3 Contour

Log Normalized Frequency

5 10 15 20
Time Points

Fitted Tone Mean

Figure 6.11 Tone 3

Comparison of the Fitted Tone 4 Contour and the Mean Tone 4 Contour

Log Normalized Frequency



5 10 15 20
Time Points

Fitted Tone Mean

Figure 6.12 Tone 4

Comparison of the Fitted Tone 5 Contour and the Mean Tone 5 Contour

Log Normalized Frequency

5 10 15 20
Time Points

Fitted Tone Mean

Figure 6.13 Tone 5

Comparison of the Fitted Tone 6 Contour and the Mean Tone 6 Contour

Log Normalized Frequency

5 10 15 20
Time Points

Fitted Tone Mean

Figure 6.14 Tone 6

Comparison of the Fitted Tone 7 Contour and the Mean Tone 7 Contour

Log Normalized Frequency

5 10 15 20
Time Points

Fitted Tone Mean

Figure 6.15 Tone 7

Phonological representations 187

Comparison of the Fitted Tone 8 Contour and the Mean Tone 8 Contour

Log Normalized Frequency



5 10 15 20
Time Points

Fitted Tone Mean

Figure 6.16 Tone 8

Based on the criteria in Table 6.17 and the fitted values for the onset and
offset of tones, the transformation can be done accordingly. Since Tones 3, 4,
5, and 6 have underlying targets of a polynomial degree of two, the turning
points of these tones were also transformed. The turning points are found by
setting the derivative of the function in the chosen model for each tone to be
zero and solving a non-​linear equation using the software R package “nleqslv”.
After the positions of the turning points are calculated, the values are plugged
into the original model to obtain the fitted values for the turning point.
For example, in calculating the positions of the turning point for Tone 4,
the derivative of the function y(t) is calculated and set to zero to solve the
non-​linear equation for t, which stands for the position of the turning point:

y (t ) = ( −9.24 − 1.37t + 0.12t 2 ) exp ( 0.18t ) + 0.02t 2 − 1.07t + 10.63

The solution is approximately t = 3.66, which is then plugged into the function
y(t) for the fitted turning point F0 value, and a further transformation is done
based on the criteria in Table 6.17. The final phonological representation is
demonstrated in Table 6.18.
Tone values in the parenthesis are shown in Table 6.1.
Compared with the values in Table 6.1 from impressionistic data, the
basic contours are characterized similarly, but specific values (one to five) are
assigned with some differences. Also, the underlying targets tested for being

188 Si Chen
Table 6.16 Models chosen for each tone and estimated coefficients

Tone Model c1 c2 c3 β λ d a b

1 sim_​1 NA NA NA -​0.21 1.39 NA -​0.12 1.21

2 com_​1 -​12.73 -​2.31 -​0.11 NA 0.16 NA -​0.41 13.19
3 sim_​2 NA NA NA 16.63 0.096 -​0.02 0.95 -​14.30
4 com_​2 -​9.24 -​1.37 0.12 NA 0.18 0.02 -​1.07 10.63
5 sim_​2 NA NA NA 4.27 0.19 -​0.0042 0.17 -​2.13
6 sim_​2 NA NA NA -​62.97 -​0.042 0.10 2.03 64.80
7 sim_​1 NA NA NA 2.32 0.58 NA -​0.008 -​0.062
8 sim_​1 NA NA NA 3.12 0.56 NA 0.09 -​1.16

Table 6.17 Quantiles based on fitted models and transformation to tone letters

Quantile 20% 40% 60% 80%

Value -​0.58 -​0.28 -​0.08 0.45
Chao’s Five-​ 1 2 3 4 5
Point Scale
<-​0.58 [-​0.58, -​0.28) [-​0.28, -​0.08) [-​0.08, 0.45) >= 0.45

quadratic (Tone 3, 4, 5, and 6) are also described using three Chao’s letters
in the description of the fieldwork study except for Tone 5, which shows con-
sistency in the usage of the turning point based on statistical modeling of
instrumental data and pure impressionistic data. Specifically, Tone 1 showed a
decreasing slope in my representations 51, as plotted in Figure 6.9, which is in
contrast to previous representations of a flatter slope 53. Tone 2 has a similar
contour to previous representations with higher offset values. Tone 3 has a
higher onset and lower turning point and offset. Tone 4, Tone 5, and Tone
6 have a higher onset, and Tone 7 has a steeper falling slope. Finally, Tone
8 shows a higher onset and offset, but still a rising slope as in the fieldwork
description. The transformed values in this study conform well to the figures
of the tonal contours presented.

6.5 Discussions and conclusions

As proposed in the introduction section, phonetics and phonology are usu-
ally deemed two distinct subfields of linguistics, and the integration of the
two is also possible and advocated for, although there have been some unified
approach attempts to treat the two subfields as one. This study adopts a
modular approach and recognizes that the two subfields are distinct, but it
also stresses that phonological representations can be offered based on stat-
istical modeling of instrumental data in addition to descriptions based on

Phonological representations 189

Table 6.18 Phonological representations based on acoustic values

Tone Onset Turning Point Offset Chao’s Five-​point Scale

1 0.997 NA -​1.190 51 (53)

2 -​0.339 NA 0.794 25 (24)
3 1.511 -​0.58 -​0.861 511 (435)
4 0.682 0.438 -​2.460 541 (241)
5 1.230 -​0.279 -​1.195 521 (33)
6 0.915 -​0.738 -​0.462 512 (213)
7 0.939 NA -​0.222 52 (55)
8 0.375 NA 0.640 45 (23)

impressionistic data. It is hoped that this method can be refined and adapted
to new sets of data in order to assist in finding a plausible tonal representation
for new phonetic data.
This study first focuses on phonetic examinations of several phon-
etic cues subject to a perceptual experiment, and it proceeds to statistically
model Chongming Chinese tones to provide a phonological representa-
tion. Before illustrating how phonetic data is transformed into phonological
representations, a perceptual study is conducted to test whether other phon-
etic cues reported in production, such as duration and glottalization, con-
tribute to discrimination of allotone pairs after voiced versus voiceless onset
in general and for specific pairs, and whether F0 contours suffice to discrim-
inate allotone pairs without including onset consonants. The perceptual
experiment used a fractional factorial design to evaluate whether accuracy
rate and reaction time are affected by three variables: duration (normalized
or original), glottalization (with or without), and onset consonants (with or
without). The results show no statistical significance obtained with respect to
accuracy rate and reaction time for the allotones tested in general. Moreover,
although glottalization and short duration are reported in the fieldwork
records, and confirmed by phonetic examination in this study to accompany
Allotone 7 and Allotone 8 specifically, they do not seem to contribute signifi-
cantly in perceptually discriminating these allotones when F0 contours are
Based on the results of the perceptual experiment, F0 contours do play
an important role in the discrimination of the allotone pairs with respect
to voiced versus voiceless onset, and they may be the primary cue, sub-
ject to further experiments. This study then proceeds to statistically model
F0 values extracted, in order to obtain a phonological representation for
monotones. Based on previous research, four models were fitted to calcu-
late each underlying pitch target, and the optimal model with minimized
AIC was chosen. Whether the underlying target is quadratic or not can also
be statistically tested, and if the underlying target is quadratic, a turning

190 Si Chen
point is found mathematically and transformed based on the chosen model.
If the underlying target is not quadratic, the turning point is not included
in the phonological representation. Then, an adjusted onset point to alle-
viate perturbation effects and an offset point are calculated for further trans-
formation. The plots of generated fitted values based on the selected model
showed similar contours to the plots of averaged F0 values. The fitted values
obtained from the optimal model were calculated, and four sample quantiles
(20 percent, 40 percent, 60 percent, and 80 percent) are calculated from all
the fitted F0 values. These quantiles are needed to transform the adjusted
onset, the turning point, and the offset to Chao’s five points by evaluating the
value with the proposed quantiles. The tones that are statistically tested to be
quadratic correspond well to the fieldwork description, and the basic tonal
shapes are similar, with some differences in the exact integers for the onset,
turning point, and offset.
For future studies, more perceptual experiments need to be designed,
testing allotone pairs with a difference in aspiration of onsets, and the iden-
tification tasks may also be used to detect whether other cues may contribute
more in identification tasks than discrimination tasks. More evaluations
need to be done on the effectiveness of Chao’s model, and the transform-
ation methods can be further developed according to the improved model.
The effectiveness of the transformation methods can also be evaluated by
perceptual experiments. The proposed method in this study is still prelim-
inary, and more data on tonal languages need to be collected to compare
the current methods with methods proposed in the literature for further

The help of language consultants in collecting Chongming Chinese data is
highly appreciated. Comments and suggestions concerning statistical mod-
eling and R code checking from professors in the Department of Statistics,
University of Florida, are gratefully acknowledged. This work is supported
by grant [1-​ZVHH] from the Faculty of Humanities and grant [G-​UAAG]
from the Department of Chinese and Bilingual Studies at the Hong Kong
Polytechnic University, and partly supported by Early Career Scheme [No.
T26023416] from the Research Grants Council of Hong Kong.


Tone Duration Glottalization Onset Accuracy Reaction

consonant rate time

1a1a + -​ -​ 0.88 1233.88

1b1b -​ + -​ 0.94 1008.19

Tone Duration Glottalization Onset Accuracy Reaction

consonant rate time

1c1c -​ -​ + 0.94 1085.19

1abc1abc + + + 0.94 980.88
1a2a + -​ -​ 0.88 1468.44
1b2b -​ + -​ 0.88 1315.94
1c2c -​ -​ + 0.88 1365.13
1abc2abc + + + 1.00 1207.19
4a4a + -​ -​ 0.96 1093.22
4b4b -​ + -​ 0.88 1100.15
4c4c -​ -​ + 0.94 1089.82
4abc4abc + + + 0.91 1050.10
1a4a + -​ -​ 0.82 1424.94
1b4b -​ + -​ 0.82 1488.69
1c4c -​ -​ + 0.82 1670.50
1abc4abc + + + 0.69 1625.44
6a6a + -​ -​ 0.92 1118.03
6b6b -​ + -​ 0.86 1126.42
6c6c -​ -​ + 0.88 1048.28
6abc6abc + + + 0.88 1158.15
1a6a + -​ -​ 0.75 1497.75
1b6b -​ + -​ 0.75 1281.32
1c6c -​ -​ + 0.88 1519.57
1abc6abc + + + 1.00 1390.44
8a8a + -​ -​ 0.91 949.25
8b8b -​ + -​ 0.91 1156.35
8c8c -​ -​ + 0.91 1051.13
8abc8abc + + + 0.91 1082.19
1a8a + -​ -​ 0.82 1225.07
1b8b -​ + -​ 0.94 1921.25
1c8c -​ -​ + 0.94 1607.57
1abc8abc + + + 0.88 1550.88
3a3a + -​ -​ 0.88 1183.13
3b3b -​ + -​ 0.94 1126.25
3c3c -​ -​ + 0.94 1450.94
3abc3abc + + + 0.94 981.63
2a3a + -​ -​ 0.69 1800.88
2b3b -​ + -​ 0.79 1372.50
2c3c -​ -​ + 0.88 1378.88
2abc3abc + + + 0.75 1577.19
5a5a + -​ -​ 0.92 1031.05
5b5b -​ + -​ 0.92 1158.19
5c5c -​ -​ + 0.90 1020.15
5abc5abc + + + 0.92 1091.63

Tone Duration Glottalization Onset Accuracy Reaction

consonant rate time

2a5a + -​ -​ 0.82 1436.63

2b5b -​ + -​ 0.82 1522.88
2c5c -​ -​ + 0.75 1567.63
2abc5abc + + + 0.69 2818.69
7a7a + -​ -​ 0.91 1060.10
7b7b -​ + -​ 0.94 1074.44
7c7c -​ -​ + 0.94 1153.35
7abc7abc + + + 0.88 1059.29
2a7a + -​ -​ 0.88 1332.94
2b7b -​ + -​ 0.82 1504.50
2c7c -​ -​ + 0.94 1613.82
2abc7abc + + + 0.94 1429.82
3a4a + -​ -​ 0.75 1424.38
3b4b -​ + -​ 0.75 1231.50
3c4c -​ -​ + 0.75 1255.00
3abc4abc + + + 0.88 1456.82
3a6a + -​ -​ 0.75 1768.25
3b6b -​ + -​ 0.82 1658.00
3c6c -​ -​ + 0.82 1884.00
3abc6abc + + + 0.63 1335.32
3a8a + -​ -​ 0.88 1290.75
3b8b -​ + -​ 0.69 1304.13
3c8c -​ -​ + 0.88 1670.38
3abc8abc + + + 0.94 1331.57
4a5a + -​ -​ 0.75 1453.69
4b5b -​ + -​ 0.82 1434.88
4c5c -​ -​ + 0.57 1491.94
4abc5abc + + + 0.82 1455.63
4a7a + -​ -​ 0.88 1752.75
4b7b -​ + -​ 0.88 1609.94
4c7c -​ -​ + 0.69 1641.57
4abc7abc + + + 0.94 1381.44
5a6a + -​ -​ 0.63 1551.50
5b6b -​ + -​ 0.75 1290.50
5c6c -​ -​ + 0.75 1178.88
5abc6abc + + + 0.82 1147.63
5a8a + -​ -​ 0.82 1684.75
5b8b -​ + -​ 0.75 1719.82
5c8c -​ -​ + 0.75 1400.69
5abc8abc + + + 0.69 1567.57
6a7a + -​ -​ 0.82 1350.57
6b7b -​ + -​ 0.69 1628.00

Phonological representations 193

Tone Duration Glottalization Onset Accuracy Reaction

consonant rate time

6c7c -​ -​ + 0.88 1340.00

6abc7abc + + + 0.88 1350.19
7a8a + -​ -​ 0.82 1281.75
7b8b -​ + -​ 0.82 1585.69
7c8c -​ -​ + 0.75 1354.75
7abc8abc + + + 0.69 1616.88

’t Hart, J., Collier, R., and Cohen, A. (1990) A perceptual study of intonation: An
experimental phonetic approach to speech melody. Cambridge: Cambridge
University Press.
Abramson, A. S. (1972) “Tonal experiments with whispered Thai” in Valdman, A.
(ed.) Papers on linguistics and phonetics to the memory of Pierre Delattre. The
Hague: Mouton, pp. 29–​55.
Abramson, A. S. (1975) “The tones of Central Thai: Some perceptual experiments”
in Harris, J. G., and Chamberlain, J. R. (eds.) Studies in Thai linguistics in honor of
William J. Gedne. Bangkok: Central Institute of English Language, pp. 1–​16.
Andruski, J. E. (2006) “Tone clarity in mixed pitch/​phonation-​type tones”, Journal of
Phonetics, 34(3), 388–​404.
Andruski, J. E., and Costello, J. (2004) “Using polynomial equations to model pitch
contour shape in lexical tones: An example from Green Mong”, Journal of the
International Phonetic Association, 34(2), pp. 125–​140.
Bao, Z.-​M. (1990) On the nature of tone. Ph.D. Diss., Massachusetts Institute of
Belotel-​Grenie, A., and Grenie, M. (1997) “Phonation and tone types in Standard
Chinese”, Cahiers de Linguistique Asie Orientale, 26(2), pp. 249–​279.
Blicher, D. L., Diehl, R. L., and Cohen, L. B. (1990) “Effects of syllable duration on
the perception of the Mandarin Tone 2/​Tone 3 distinction: Evidence of auditory
enhancement”, Journal of Phonetics, 18(1), pp. 37–​49.
Boersma, P., and Weenink, D. (2013) Praat: Doing phonetics by computer [Computer
program]. Version 5.3.52. Retrieved 12 June 2013,​.
Brunelle, M. (2003) “Tone coarticulation in Northern Vietnamese”, in Solé, M. J.,
Recasens, D., and Romero, J. (eds.) Proceedings of the 15th International Congress
of Phonetic Sciences. Barcelona: ICPhS Archive, pp. 2673–​2676.
Brunelle, M. (2009) “Northern and Southern Vietnamese tone coarticulation: A
comparative case study”, Journal of the Southeast Asian Linguistics Society, 1,
pp. 49–​62.
Burnham, D., and Francis, E. (1997) “The role of linguistic experience in the percep-
tion of Thai tones” in Abramson, A. S. (ed.) South East Asian linguistic studies
in honour of Vichin Panupong. Science of Language 8. Bangkok: Chulalongkorn
University Press, pp. 29–​47.

194 Si Chen
Chang, L. M. (1992) A prosodic account of tone, stress and tone sandhi in Chinese
languages. Ph.D. Diss., University of Hawaii.
Chao, Y. R. (1930) “A system of tone letters”, Le Maitre Phonetique, 45, pp. 24–​27.
Charpentier, F., and Stella, M. (1986) “Diphone synthesis using an overlap-​add tech-
nique for speech waveforms concatenation”, Proceedings of ICASSP, 86(3), pp.
Chen, M. (2000) Tone sandhi patterns across Chinese dialects. Cambridge: Cambridge
University Press.
Chen, M., and Zhang, H.-​ M. (1997) “Lexical and postlexical tone sandhi in
Chongming” in Wang, J.-​L., and Smith, N. (eds.) Studies in Chinese phonology.
Berlin and New York: Mouton de Gruyter, pp. 13–​52.
Chen, S., Zhang, C., McCollum, A. G., and Wayland, R. (2017) “Statistical modelling of
phonetic and phonologised perturbation effects in tonal and non-​tonal languages”,
Speech Communication, 88, pp. 17–​38. doi: 10.1016/​j.specom.2017.01.006
Chomsky, N., and Halle, M. (1968) The sound pattern of English. New York: Harper
and Row.
Clements, G., and Ford, K. C. (1979) “Kikuyu tone shift and its synchronic
consequences”, Linguistic Inquiry, 10(2), 179–​210.
Cohn, A. C. (1990) Phonetic and phonological rules of nasalization. Ph.D. Diss.,
University of California Los Angeles. Distributed as UCLA Working Papers in
Phonetics 76.
Cohn, A. C. (2007) “Phonetics in phonology and phonology in phonetics”, Working
Papers of the Cornell Phonetics Lab, 16, pp. 1–​31.
Davison, D. S. (1991) “An acoustic study of so-​ called creaky voice in Tianjin
Mandarin”, University of California Working Papers in Phonetics, 78, pp. 50–​57.
Duanmu, S. (1990) A formal study of syllable, tone, stress and domain in Chinese
languages. Ph.D. Diss., Massachusetts Institute of Technology.
Duanmu, S. (1994) “Against contour tone units”, Linguistic Inquiry, 25(4), pp.
Flemming, E. (2001) “Scalar and categorical phenomena in a unified model of
phonetics and phonology”, Phonology, 18, pp. 7–​44.
Fujisaki, H. (2004) “Prosody, information, and modeling: With emphasis of tonal
features of speech”, in Bel, B., and Marlien, I. (eds.) Speech prosody. Nara: ISCA
Archive, pp. 1–​10.
Fujisaki, H., and Kawashima, T. (1969) “On the modes and mechanisms of speech
perception”, Annual Report of the Engineering Research Institute, 28, pp. 67–​73.
Fujisaki, H., Wang, C., Ohno, S., and Gu, W.-​T. (2005) “Analysis and synthesis of fun-
damental frequency contours of Standard Chinese using the command-​response
model”, Speech Communication, 47, pp. 59–​70.
Gandour, J. (1978) “The perception of tone” in Fromkin, V. (ed.) Tone: A linguistic
survey. New York: Academic, pp. 26–​37.
Gerrits, E., and Schouten, M. E. H. (2004) “Categorical perception depends on the
discrimination task”, Perception & Psychophysics, 66(3), 363–​376.
Goldsmith, J. A. (1976) Autosegmental phonology. Ph.D. Diss., Massachusetts
Institute of Technology.
Goldsmith, J. A. (1999) Phonological theory: The essential readings. Oxford: Blackwell.
Hombert, J. M. (1976) “Perception of tones of bysyllabic nouns in Yoruba”, Studies in
African Linguistics, Supplement 6, pp. 109–​121.

Phonological representations 195

Hombert, J. M. (1977) “Development of tones from vowel height?” Journal of
Phonetics, 5, pp. 9–​16.
House, A. S., and Fairbanks, G. (1953) “The influence of consonant environment
upon secondary acoustical characteristics of vowels”, Journal of the Acoustical
Society of America, 25(1), pp. 105–​113.
Hyman, L. M. (2013) “Enlarging the scope of phonologization” in Yu A. (ed.) Origins
of sound change: Approaches to phonologization. Oxford: Oxford University Press,
pp. 3–​28.
International Phonetic Association. (1989) “Report on the 1989 Kiel Convention”,
Journal of the International Phonetic Association, 19(2), pp. 67–​80.
Jangjamras, J. (2012) Perception and production of English lexical stress by Thai
speakers. Ph.D. Diss., University of Florida.
Keating, P. A. (1996) “The phonology-​phonetics interface” in Kleinhenz, U. (ed.),
Inter-​faces in phonology. Berlin: Akademie Verlag, pp. 262–​278.
Keyser, S. J., and Stevens, K. N. (2001) “Enhancement revisited” in Kenstowicz, M.,
and Hale, K. (eds.) A life in language. Cambridge, MA: MIT Press, pp. 271–​291.
Kim, K., and Timm, N. (2006) Univariate and multivariate general linear models: Theory
and applications with SAS. Boca Raton, FL: CRC Press.
Kingston, J. (2007) “The phonetics-​phonology interface” in de Lacy, P. (ed.) The
Cambridge handbook of phonology. Cambridge: Cambridge University Press, pp.
Kong, Q. M. (1987) “Influence of tones upon vowel duration in Cantonese”, Language
and Speech, 30(4), pp. 387–​399.
Kuehl, R. (2000) Design of experiments: Statistical principles of research design and
analysis. Belmont, CA: Brooks/​Cole, Cengage Learning.
Lemmetty, S. (1999) Review of speech synthesis technology. MS Thesis, Helsinki
University of Technology.
Lin, H. S. (2011) “Sequential and tonal markedness in Dongshi Hakka tone sandhi”,
Language and Linguistics, 12(2), pp. 313–​357.
Local, J., Oqden, R., and Temple, R. (2003) Phonetic interpretation: Papers in labora-
tory phonology VI. Cambridge: Cambridge University Press.
McCarthy, J. (1996) Faithfulness and prosodic circumscription. MS Thesis, University
of Massachusetts, Amherst.
McCarthy, J. (2002) A thematic guide to optimality theory. Cambridge: Cambridge
University Press.
McCarthy, J. (2007) “What is optimality theory?” Language and Linguistics Compass,
1(4), pp. 260–​291.
McCarthy, J., and Prince, A. (1993) Prosodic morphology I: Constraint interaction
and satisfaction. MS Thesis, University of Massachusetts, Amherst, and Rutgers
University, New Brunswick, NJ.
McCarthy, J., and Prince, A. (1994) “The emergence of the unmarked: Optimality
in prosodic morphology” in Gonzàle, M. (ed.) Proceedings of the North-​ East
Linguistics Society 24. Amherst, MA: Graduate Linguistics Students Association,
pp. 333–​379.
McGuire, G. (2010) A brief primer on experimental designs for speech perception
research. Laboratory Report 77.
Nolan, F. (2003) “Intonational equivalence: An experimental evaluation of pitch
scales” in Solé, M. J., Recasens, D., and Romero, J. (eds.) Proceedings of the

196 Si Chen
15th International Congress of Phonetic Sciences. Barcelona: ICPhS Archive, pp.
Paterson, H. J. (2015) “Phonetic transcription of tone in the IPA” in The Scottish
Consortium for ICPhS 2015 (ed.) Proceedings of the 18th International Congress of
Phonetic Sciences. Glasgow, UK: University of Glasgow, pp. 507.1–​5.
Pham, A. H. (2003) Vietnamese tone: A new analysis. Outstanding Studies in
Linguistics. London: Routledge.
Pierrehumbert, J. (1980) The phonology and phonetics of English intonation. Ph.D.
Diss., Massachusetts Institute of Technology.
Prince, A., and Smolensky, P. (1993) Optimality theory: Constraint interaction in
generative grammar. MS Thesis, Rutgers University, New Brunswick, NJ, and
University of Colorado, Boulder.
Prom-​On, S., Xu, Y., and Thipakorn, B. (2009) “Modeling tone and intonation in
Mandarin and English as a process of target approximation”, Journal of the
Acoustical Society of America, 125(1), pp. 405–​424.
Rose, P. (2014) “Transcribing tone –​A likelihood-​based quantitative evaluation of
Chao’s ‘tone letters’” in Li, H., Meng, H. M., Ma, B., Chng, E., and Xie, L. (eds.)
15th Annual Conference of the International Speech Communication Association
(INTER-​SPEECH’14). Singapore: ISCA Archive, pp. 101–​105.
Shi, F. (1990) Yuyinxue Tanwei [An exploration of phonetics]. Peking: Peking
University Press.
Silverman, K., Beckman, M., Pitrelli, J., Ostendorf, M., Wightman, C., Price, P.,
Pierrehumbert, J., and Hirschberg, J. (1992) “ToBI: A standard for labeling English
prosody” in Proceedings of ICSLP, Banff, pp. 867–​870.
Snodgrass, J., Levy-​Berger, G., and Haydon, M. (1985) Human experimental psych-
ology. New York: Oxford University Press.
Steriade, D. (2000) “Paradigm uniformity and the phonetics phonology boundary” in
Broe, M., and Pierrehumbert, J. (eds.) Papers in laboratory phonology V: Acquisition
and the lexicon. Cambridge: Cambridge University Press, pp. 313–​334.
Sun, X.-​J. (2001) “Predicting underlying pitch targets for intonation modeling” in
4th ISCA Tutorial and Research Workshop on Speech Synthesis. Perthshire: ISCA
archive, pp. 143–​148.
Valbret, H., Moulines, E., and Tubach, J. (1991) “Voice transformation using PSOLA
technique”, Proceedings of Eurospeech, 91(1), pp. 345–​348.
Wayland, R., and Guion, S. (2003) “Perceptual discrimination of Thai tones by native
and experienced learners of Thai”, Applied Psycholinguistics, 24(1), pp. 113–​129.
Werker, J. F., and Logan, J. (1985) “Cross-​language evidence for three factors in speech
perception”, Perception and Psycholinguistics, 37(1), pp. 35–​44.
Whalen, D. H., and Xu, Y. (1992) “Information for Mandarin tones in the amplitude
contour and in brief segments”, Phonetica, 49(1), pp. 25–​47.
Wong, Y. W. (2006) “Contextual tonal variations and pitch targets in Cantonese” in
Proceedings of speech prosody. Dresden, pp. 317–​320.
Xu, Y. (2005) “Speech melody as articulatorily implemented communicative
functions”, Speech Communication, 46, pp. 220–​251.
Xu, Y., and Prom-​on, S. (2014) “Toward invariant functional representations of vari-
able surface fundamental frequency contours: Synthesizing speech melody via
model-​based stochastic learning”, Speech Communication, 57, pp. 181–​208.
Xu, Y., and Wang, Q. E. (2001) “Pitch targets and their realization: Evidence from
Mandarin Chinese”, Speech Communication, 33, pp. 319–​337.

Phonological representations 197

Yip, M. (1990) “The Phonology of Cantonese Loanwords: Evidence for Unmarked
Settings for Presodic Parameters.” Paper presented at the Second Northeastern
Conference on Chinese Linguistics, University of Pennsylvania.
Yip, M. (2002) Tone. Cambridge: Cambridge University Press.
Zhang, H.-​Y. (2009) Chongming Fangyan Yanjiu [A study of Chongming Chinese].
Beijing: China Social Sciences Press.
Zhang, J. (2001) The effects of duration and sonority on contour tone distribution –​
typological survey and formal analysis. Ph.D. Diss., University of California Los
Zhang, Y., Nissen, S., and Francis, A. (2008) “Acoustic characteristics of English
lexical stress produced by native Mandarin speakers”, Journal of the Acoustical
Society of America, 123(6), pp. 4498–​4513.
Zhu, X. N., Shi, D. F., and Wei, M. Y. (2012) “Six level tones in Yuliang Miao and the
multiregister and four-​level tonal model”, Minzu Yuwen [Minority languages of
China], 4, pp. 1–​10.

Prosodic encoding of contrastive
focus in Shanghai Chinese
Bijun Ling and Jie Liang

7.1 Introduction
It is well known that in speech communication, the same sentence is often
uttered differently depending on the communicative context and the speaker’s
intention. There are at least three ways to package an utterance in order to
integrate it into the information flow of ongoing discourse:

(1) using word order (i.e., given information generally precedes focused
information) (e.g., Birner 1994; Clark and Clark 1978); (2) using particular
lexical items and syntactic constructions (e.g., using cleft constructions
such as ‘It was Damon who fried an omelet’) (Lambrecht 2001); and
(3) using prosody. Prosody is comprised of acoustic features like fun-
damental frequency (f0), duration, and loudness, the combinations of
which give rise to psychological percepts like phrasing (grouping), stress
(prominence), and tonal movement (intonation).
(Breen et al. 2010: 1049)

There have been many studies on the prosodic encoding of different informa-
tion structure notions, especially focus in many languages and from different

7.1.1 Focus encoding

Generally speaking, focused elements are more acoustically prominent than
given elements. Some features that have been proposed to be associated with
prominence include pitch (i.e., f0) (Lieberman 1960; Cooper, Eady, and Cooper
1986), duration (Beckman 1986), loudness (i.e., intensity) (Beckman 1986;
Turk and Sawusch 1996; Kochanski et al. 2005), and voice quality (Sluijter
and van Heuven 1996). Focus is normally represented as intensity increasing,
duration lengthening, focal f0-​rise, or f0 range expansion, and post-​focal
reduction (for Germanic languages, Pierrehumbert 1980; Cooper, Eady, and
Mueller 1985; Eady et al. 1986; Bartels and Kingston 1994; Baumann, Grice,
and Steindamm 2006; Cruttenden 2006; Ishihara and Fery 2006; Fery and
Kugler 2008; for Sinitic languages, Garding 1987; Shih 1988; Xu 1999; Chen

Prosodic encoding 199

2005, 2010; Chen and Gussenhoven 2008, among others). Among the three
acoustic cues (f0, duration, intensity), most scholars have argued that f0 is a
highly important acoustic feature (Liberman 1960; Rietveld and Gussenhoven
1985; Gussenhoven et al. 1997; Terken 1991); however, more recently some
people have pointed out that intensity (and duration) are better predictors of
perceived prominence than pitch (Beckman 1986; Turk and Sawusch 1996;
Kochanski et al. 2005).
Furthermore, the general pattern of focal f0-​rising/​expansion and post-​
focal f0-​lowering/​compression has been hypothesized in different languages
as being due to different linguistic representations (Chen 2010). For example,
in English, where f0 does not typically differentiate lexical meanings, this leads
to the structural effect of focus, that is, the association of nuclear pitch accent
with the focused constituent and the absence of pitch accents on post-​focal
constituents (Ladd 1980, 1996; Gussenhoven 1983, 1984; Selkirk 1984, 1995).
In languages such as Standard Chinese, where f0 movements indicate lexical
contrast, speakers cannot accent or deaccent words according to their infor-
mation status. Focus is manifested via pitch range manipulation (Garding
1987; Shih 1988; Xu 1999). Specifically, “the pitch range of the focused region
is expanded; that of the post-​focus region compressed (i.e., PFC); and that of
the pre-​focused region left largely neutral” (Xu 2005: 235).
Although there are a number of studies showing the pitch range com-
pression of the post-​focus constituents (e.g., Shih 1988; Xu 1999), Chen
(2010) reported that post-​focus lexical tones can also be realized with a more
expanded f0 range than their pre-​focus counterparts. She showed that the spe-
cific f0 contours of post-​focus lexical tones are subject to the influence of
(mainly) preceding as well as (sometimes) following tonal contexts. Despite
the sometimes expanded f0 range, post-​focus lexical tones are consistently
realized with significantly less distinctive f0 contours than their on-​focus
counterparts. Therefore, Chen proposed that

f0 range compression is not the only and primary characteristic of post-​

focus tonal realization. Rather, the observed post-​focus effects may be
viewed as the multifaceted manifestations of weak implementation of
post-​focus tonal targets, as they are associated with prosodically non-​
prominent constituents.
(2010: 517)

Last but not least, there has been a long debate on the relationship between
focus encoding and prosodic structure. Different analyses have been proposed
and can be roughly divided into two sub-​groups: (1) Indirect Encoding: The
prosodic encoding of focus is mediated by prosodic structure. In particular,
focus relocates prosodic prominence and in turn triggers inserting or deleting
prosodic boundaries, and the phonetic effects are the results of this modi-
fication (Pierrehumbert 1980; Gussenhoven 1983; Truckenbrodt 1995; Ladd
1996). In other words, the focal f0-​rise is a result of the prosodic boundary

200 Bijun Ling and Jie Liang

insertion to the left of the focus, while the post-​focal reduction is a result of
the boundary deletion. (2) Direct Encoding: Focus is assigned to morpho-
syntactic elements and is then encoded directly into prosody without any ref-
erence to the prosodic structure (Eady and Cooper 1986; Xu and Xu 2005;
Ishihara 2011). Along this line, focal F0-​rise and post-​focal reduction is a
direct manipulation of pitch register.

7.1.2 Shanghai Chinese

Shanghai Chinese (SHC) is a Northern Wu dialect, spoken in metropolitan
Shanghai. It has five lexical tones, which are represented by a set of features: (1)
Contour: falling (T1) and rising (T2–​T5); (2) Pitch register: high (T1, T2, and
T4) and low (T3 and T5 with breathy phonation); (3) Duration: long (T1–​
T3) and short (T4 and T5 with glottal coda). Furthermore, there are three
features of its tonal system that differentiate Shanghai Chinese from Standard
Chinese. First, SHC has maintained a contrast between voiceless and voiced
consonants, which is represented as a distinction between modal and breathy
phonation nowadays, as well as a co-​occurrence restriction between voi-
cing/​phonation and f0: The tones in a high pitch register (T1[53], T2[34],
and T4[55]) only occur after voiceless consonants (modal phonation), while
the tones in low pitch register (T3[13] and T5[12]) only occur after voiced
consonants (breathy phonation) (Cao and Maddieson 1992; Chen 2014;
Chen and Gussenhoven 2015). Second, SHC has retained the checked syllable
ending with a glottal coda [CVʔ], which has considerably shorter duration
than open or sonorant-​closed syllables [CV(N)]. There are two tones (T4[55]
and T5[12]) linked to checked syllables and three tones (T1[53], T2[34],
T3[13]) linked to open or sonorant-​closed syllables (Zhang and Meng 2016).
Last but not least, the most distinctive feature of SHC is that when syllables
are organized into groups, lexical tones undergo sandhi changes, which are
sensitive to the morphosyntactic structure: Compound words (e.g., modifier–​
nouns) undergo left-​dominant sandhi (LDS), which spreads the tone of the
initial syllable across the entire word; while phrases (e.g., non-​lexicalized
verb–​nouns) undergo right-​dominant sandhi (RDS), which retains the tone
of the final syllable and levels the contour of the non-​final tone (Xu and Tang
1988) (see Table 7.1).
Furthermore, there have been several phonological analyses on the tone
sandhi in SHC (Zee and Maddieson 1980; Selkirk and Shen 1990; Chen
2000; Yip 2002). Most linguists agreed that LDS is a phonological tonal
rightward spread, because the base tone contrast of σ2 is neutralized and
its f0 contour is determined by the base tone of σ1 (Zee and Maddieson
1980; Chen 2000; Yip 2002). However, due to the lack of empirical study on
RDS, the nature of RDS is still in dispute. Duanmu (2005) regards it as a
phonological leveling with neutralized level targets, while Takahashi (2013)
and Zhang and Meng (2016) propose that it is better interpreted as a phon-
etic contour reduction. In addition, Selkirk and Shen (1990) first proposed
that the application of LDS/​RDS is dependent on the prosodic structure:

Prosodic encoding 201

Table 7.1 The value of citation tones and sandhi tones in SHC (using Chao’s
five-​level numerical scale, which divides a speaker’s pitch range into
five scales, with 5 indicating the highest and 1 the lowest)

Register Tone group LDS RDS

High T1[53]+X 55+31

T4[55]+X 33+44 44+X

Low T3[13]+X 22+44 33+X

T5[12]+X 11+13 22+X

A prosodic word, comprised of one prosodic domain, is supposed to under-

take LDS, while a prosodic phrase, comprised of two prosodic domains, is
supposed to undertake RDS. Take “炒饭 [tshɔ34 vε13]”, for example. When
it means [fried rice] as a NP compound, it forms one prosodic word (with
one domain) undertaking LDS [tshɔ33 vε44]; when it means [to fry rice] as a
VP phrase, it forms one prosodic phrase (with two domains) undertaking
RDS [tshɔ44 vε13], as illustrated below.

Left-​dominant sandhi Right-​dominant sandhi

Morphosyntax: (Adj. N) NP (V N) VP
炒 饭 (fried rice) 炒 饭 (to fry rice)
/​tshɔ33 vε44/​ /​tshɔ44 vε13/​
Prosodic headedness: * *
Prosodic structure: (Adj. N)𝔀 ((V)𝔀 (N)𝔀)𝝋

These unique tonal features make SHC an interesting case for the study of
the prosodic encoding of contrastive focus and the relationship between focus
encoding and prosodic structure. In this chapter, we attempted to answer the
following questions:

1 What are the f0, duration, and intensity patterns of compound words and
VP phrases? Do they reflect the same or different prosodic structures?
2 How does contrastive focus affect the f0, duration, and intensity patterns
of compound words and VP phrases?
3 What is the relationship between focus encoding and the prosodic struc-
ture in SHC, direct or indirect?

7.2 Method

7.2.1 Test materials

The target syllables in our materials are indicated as S1, S2, S3, and S4
(illustrated in Table 7.2). S1 and S2 form a modifier-​noun compound (an

202 Bijun Ling and Jie Liang

Table 7.2 Stimulus sentences

Sentence S1 S2 S3 S4 Meaning

Type 1 老/​lɔ3/​ 周/​tsɤ1/​ 买/​ma3/​ 糕/​gɔ1/​ (Old Zhou buys a cake.)

old a surname to buy cake
Type 2 老/​lɔ3/​ 周/​tsɤ1/​ 烧/​sɔ1/​ 饭/​ve3/​ (Old Zhou cooks rice.)
old a surname to cook rice
Type 3 周/​tsɤ1/​ 老/​lɔ3/​ 买/​ma3/​ 糕/​gɔ1/​ (Respected old Zhou buys a
a surname old people to buy cake cake.)
Type 4 周/​tsɤ1/​ 老/​lɔ3/​ 烧/​sɔ1/​ 饭/​ve3/​ (Respected old Zhou cooks
a surname old people to cook rice rice.)

Note: The numbers in the upper right corner indicate the tone type.

appellation “Old Zhou” or “Respected old Zhou”), which is supposed to

undertake left-​dominant sandhi, while S3 and S4 form a verb-​noun phrase (“to
buy cake” or “to cook rice”), which is supposed to undertake right-​dominant
sandhi, according to the literature (Xu and Tang 1988). Furthermore, each
syllable position includes two tones: T1[HL] and T3[LH] respectively and only
two tonal combinations T1[HL]+T3[LH], and T3[LH]+T1[HL] are selected
for each compound word or VP phrase. In total, there are four stimulus
sentences (see Table 7.2). In order to avoid the influence of the boundary
intonation on the duration, f0, and intensity of the syllable at the onset and
offset of the sentence, these target syllables were embedded in a carrier sen-
tence “[ŋu3 sia2] …… [gəʔ5 tɕy2] [ɦɛ3ɦo3]” (I wrote “……” this sentence).
Furthermore, the stimulus sentences were elicited in two discourse contexts
with a corresponding Yes/​ No-​question: (1) Non-​ focused condition: The
target syllables were old information in the discourse (hereafter referred to
as [NF]); (2) Focused condition: S1, S2, S3, or S4 was the corrective informa-
tion in the discourse, that is, the contrastive focus was located on one of the
four target syllables (hereafter referred to as [F-​S1], [F-​S2], [F-​S3], or [F-​S4]
respectively). In total, we have five different focus conditions: non-​focused
[NF], focus located on S1[F-​S1], S2[F-​S2], S3[F-​S3], or S4[F-​S4]. The non-​
focused condition served as the baseline for the other focused conditions. An
example is illustrated in Table 7.3.

7.2.2 Subjects and recording procedures

Six male and eight female speakers, between the ages of 25 to 35, participated
in the study. Both their parents and they were born and raised in Shanghai
urban areas. They were paid for their participation and none reported any
hearing, vision, or reading deficiencies. The recording was conducted in the
sound booth at Tongji University. To obtain reliable measures of intensity,
participants wore a Sennheiser PC 300 high-​quality headset microphone that
was positioned a constant distance (about 5 cm) from the speaker’s mouth.

Prosodic encoding 203

Table 7.3 An example of discourse contexts

Stimulus sentence

[ŋu3 sia2] “[lɔ3 tsɤ1] [ma3] [gɔ1]” [gəʔ5 tɕy2] [ɦɛ3ɦo3].

(I wrote “Old Zhou buys a cake” this sentence.)
a) [noŋ3] [sia2] “[lɔ3 tsɤ1] [ma3] [gɔ1]” [va5]?
Did you write “Old Zhou buys a cake”?
b) [noŋ3] [sia2] “[siɔ2 tsɤ1] [ma3] [gɔ1]” [va5]?
Did you write “Little Zhou buys a cake”?
c) [noŋ3] [sia2] “[lɔ3 tsaŋ1] [ma3] [gɔ1]” [va5]?
Did you write “Old Zhang buys a cake”?
d) [noŋ3] [sia2] “[lɔ3 tsɤ1] [tsəŋ1] [gɔ1]” [va5]?
Did you write “Old Zhou steams a cake”?
e) [noŋ3] [sia2] “[lɔ3 tsɤ1] [ma3] [ve3]” [va5]?
Did you write “Old Zhou buys rice”?
a) [ŋu3 sia2] “[lɔ3 tsɤ1] [ma3] [gɔ1]” [gəʔ5 tɕy2] [ɦɛ3ɦo3] [N-​F]
b) [ŋu3 sia2] “[lɔ3 tsɤ1] [ma3] [gɔ1]” [gəʔ5 tɕy2] [ɦɛ3ɦo3] [F-​S1]
c) [ŋu3 sia2] “[lɔ3 tsɤ1] [ma3] [gɔ1]” [gəʔ5 tɕy2] [ɦɛ3ɦo3] [F-​S2]
d) [ŋu3 sia2] “[lɔ3 tsɤ1] [ma3] [gɔ1]” [gəʔ5 tɕy2] [ɦɛ3ɦo3] [F-​S3]
e) [ŋu3 sia2] “[lɔ3 tsɤ1] [ma3] [gɔ1]” [gəʔ5 tɕy2] [ɦɛ3ɦo3] [F-​S4]

The leading questions were recorded beforehand by a male speaker (not

included among the participants). For each trial, an appropriate leading
question was played through headphones to the participant and then the
participant read aloud the target sentence. The sentences were presented in
PowerPoint slides in random orders. Every participant read the material three
times. Therefore, we achieved 4 stimulus sentences * 5 focus conditions * 14
speakers *3 times = 840 sentences.

7.2.3 Data analysis

The acoustic analysis was done in Praat (Boersma and Weenink 2010). The
onset and offset of vowels were manually labeled by referring to the change of
F2 in the spectrogram. Then the Praat script “ProsodyPro” (Xu 2013) was run,
and it automatically provided us with (1) accurate f0 tracks using a method
that combines automatic vocal pulse marking by Praat, a trimming algorithm
that removes spikes and sharp edges, and has a triangular smoothing function
(cf. Appendix 1 in Xu 1999); and (2) maxf0 (Hz), minf0 (Hz), mean intensity
(db), and duration (ms) from each labeled interval.
Subsequently, the f0 measurements in Hz were converted to semitone
relative to 50 Hz using the formula in (1) to better reflect pitch perception
(Rietveld and Chen 2006). Formula (1) relates frequency in semitones, F, to
frequency in Hz, f:

204 Bijun Ling and Jie Liang

F = 12*log2(f/​50) (1).

In order to eliminate the significant individual differences in mean inten-

sity and duration, the relative intensity and duration were calculated, div-
iding the absolute value of intensity and duration by the average value of
each speaker.
Linear mixed-​effects models (using lme4 package) were used to investigate
how f0, duration, and intensity were affected by focus condition and prosodic
structure. All statistical analyses were carried out in R version 3.3.2 (R Core
Team 2016) using the lme4 package version 1.1–​12 (Bates et al. 2014).

7.3 Results
Figure 7.1 displays mean f0 contours of the four target syllables in the four
stimulus sentences, uttered in the non-​focused condition. These f0 contours
were obtained by taking 10 f0 points (in Hz) at proportionally equal time
intervals between the acoustic onset and offset of the vowel in the target
syllables, and then these values were transformed into semitones and averaged
across speakers and repetitions.

S1 S2 S3 S4


F0 (st)




1 5 10 1 5 10 1 5 10 1 5 10
Normalized time

Sentence Type1 Type2 Type3 Type4

Figure 7.1 The time-​normalized f0 contours of the four target syllables within the four
sentence types, uttered in non-​focused condition. S1, S2, S3, and S4 stand
for the first, second, third, and fourth syllable within the sentence. Sentence
types 1 is “/​lɔ3tsɤ1 ma3gɔ1/​”; Type 2 is “/​lɔ3tsɤ1 sɔ1 ve3/​”; Type 3 is “tsɤ1lɔ3
ma3gɔ1/​”; and Type 4 is “tsɤ1lɔ3 sɔ1ve3/​”

Prosodic encoding 205

According to the literature (Xu and Tang 1988), S1 and S2 form a com-
pound noun undertaking left-​dominant sandhi, while S3 and S4 form a VP
phrase undertaking right-​dominant sandhi. Although they have the same
tonal combinations T1[HL]+T3[LH] and T3[LH]+T1[HL], the f0 patterns of
S1+S2 and S3+S4 are completely different.
The f0 contours of S3 and S4 are quite consistent. In both S3 and S4
positions, the f0 contours of T1[HL] (烧/​sɔ1/​ and 糕/​gɔ1/​) are realized as
high-​falling tone, and those of T3[LH] (买/​ma3/​ and 饭/​ve3/​) are realized as
low-​rising tone, which are consistent with their citation tones respectively.
This result is in different from that in the literature: A VP phrase should
undertake right-​dominant tone sandhi, which retains the tone of the final
syllable and levels the contour of the non-​final tone (Xu and Tang 1988).
Specifically, T1[HL] and T3[LH] in S3 position should be realized as high-​
level and low-​level tones respectively. However, in our study, they seem to
retain their base tone contours: high-​falling and low-​rising respectively. This
might be related to the congruous tonal combinations (T1[HL]+T3[LH] and
T3[LH]+T1[HL]) we used in our study, which inhibited tonal leveling. If the
tonal combinations are changed into T1[HL]+T1[HL] or T3[LH]+T3[LH],
in which two syllables have contradictory tonal targets, they might be realized
as high-​level and low-​level tones as recorded in the literature. However, even
if our assumption holds, the right-​dominant sandhi is more likely caused by
tonal coarticulation and is better interpreted as a phonetic contour reduction
as proposed by Takahashi (2013) and Zhang and Meng (2016), rather than
a phonological tone leveling process as proposed by Duanmu (2005). All in
all, further investigations on the f0 patterns of the right-​dominant sandhi are
needed in order to have a better understanding of the nature of tone sandhi
in Shanghai Chinese.
The f0 contours of S1 and S2 are quite different from those of S3 and S4,
although they have the same tonal combinations. In S1 position, T1[HL] (周/​
tsɤ1/​) and T3[LH] (老/​lɔ3/​) are realized as high leveling and low slight falling
respectively, instead of high-​falling and low-​rising tones in S3 position. The
main difference lies in S2 position, as T1[HL] (周/​tsɤ1/​) and T3[LH] (老/​lɔ3/​)
are realized as high-​leveling and high-​falling tones respectively, which are
completely different from their citation tones. These results are in line with the
literature (Xu and Tang 1988), which finds that a compound word undertakes
the left-​dominant tone sandhi, which spreads the tone contour of the ini-
tial syllable across the entire word. In other words, the base tone contrast
of S2 is neutralized, and the f0 contour is determined by the base tone of
S1. In our study, there are only two tonal combinations T1[HL]+T3[LH] and
T3[LH]+T1[HL]. Therefore, in the T1+T3 group, T3[LH] in S2 position loses
its base tone contour, accepts the tonal target [L]‌assigned by T1[HL] from
S1 position, and is realized as low slight falling tone. Similarly, in the T3+T1
group, T1[HL] in S2 position accepts the tonal target [H] assigned by T3[LH]
from S1 position, and is realized as high-​leveling tone. Furthermore, there is
an obvious anticipatory effect at the offset f0 of the second syllable, as the

206 Bijun Ling and Jie Liang

Table 7.4 The description of the tonal realization of S1+S2 compound and
S3+S4 phrase

Tonal combination Compound VP Phrase

S1+S2 S3+S4


Tone sandhi Left-​dominant Right-​dominant

offset f0 of S2 in Type 2 and Type 4 (followed by T1[HL]) is obviously higher

than their counterparts in Type 1 and Type 3 (followed by T3[LH]).
In summary, in our study S1 and S2 form a modifier-​noun compound, and
their f0 patterns are in line with the finding in the literature (Xu and Tang
1988) that a compound undertakes left-​dominant tone sandhi, and that the
base tone contrast of σ2 is neutralized and the f0 pattern of the whole com-
pound is determined by the base tone of σ1. However, there is some diver-
gence from the literature with regard to the f0 patterns of S3+S4. According
to Xu and Tang (1988), a VP phrase should undertake right-​dominant sandhi,
which retains the tone of the final syllable and levels the contour of the non-​
final tone. In our study, S3 and S4 form a VP phrase, but both syllables retain
their base tone contours. Is this phenomenon caused by the congruous tonal
combination, or is it realized in this way all the time? Further investigation
is required. All in all, we use [H]‌and [L] to describe the tonal realization of
S1+S2 and S3+S4 groups in Table 7.4. [H] and [L] here only represent the
phonetic realization rather than the phonological analysis of the tonal targets.

7.3.1 Analysis of f0 measurements

Figure 7.2 illustrates the effects of contrastive focus on the f0 realization of
each target syllable in stimulus sentences. We can see that, in general, con-
trastive focus is realized by raising the maxf0 of the focused syllable while
having little influence on the minf0. Furthermore, in F-​S1 and F-​S2 conditions,
there is an obvious post-​focus compression (PFC) phenomenon on S3 and S4,
as the f0 contour of both S3 and S4 is significantly lowered and compressed.
Last but not least, the f0 adjustment of S1 and S2 are the same in F-​S1 and
F-​S2 conditions, regardless of the different focus position. In other words,
different focus positions are not reflected through the f0 implementation of
S1+S2 compound.
In order to confirm our observation and to investigate the effect of con-
trastive focus on the f0 realization of each syllable, the data was split according
to syllable position and tone type, as we can see that contrastive focus induced
different f0 adjustments of T1[HL] and T3[LH]. Then linear mixed-​effects
models were conducted on minF0 and maxF0 of each syllable respectively,

Prosodic encoding 207

with condition (N-​F, F-​S1, F-​S2, F-​S3, versus F-​S4) as fixed factor, and with
speaker and repetition as random factors. The results of the effects of con-
trastive focus on the f0 realization of each syllable are summarized in Table 7.5.
Results showed that F-​ S1 and F-​ S2 conditions induced the same f0
adjustments of each syllable. That is, the f0 contours of S1 and S2 were
enhanced, while those of S3 and S4 were reduced, which confirmed our
above observation. Specifically, because T1[HL] is realized as [HH] in S1
and S2 positions as elaborated in 7.3.1, contrastive focus significantly raised
the maxf0 and minf0 of T1 in both S1 and S2 positions. Because T3[LH]
is realized as [LL] and [HL] in S1 and S2 positions respectively, the adjust-
ment induced by contrastive focus was different in the two positions. In S1
position, the minf0 of T3[LL] was significantly decreased, while there was
no significant change of its maxf0; in S2 position, the maxf0 of T3[HL] was
significantly increased, while the minf0 was significantly decreased. In other
words, contrastive focus made the tonal contour more distinctive as [H]‌tone
higher and [L] tone lower. Furthermore, S1 and S2 had the same f0 adjust-
ment patterns regardless of the focus position (on S1 or S2), in order to retain
the f0 pattern of the compound. In addition, the maxf0 and minf0 of both
S3 and S4 were significantly decreased in both F-​S1 and F-​S2 conditions,
reflecting a clear post-​focus compression (PFC) phenomenon. It should be
noted that, when contrastive focus was located on S1 (in F-​S1 condition),
there was no PFC phenomenon on S2; rather, its f0 contour was enhanced as
elaborated above. Such a result indicated that post-​focus compression is sen-
sitive to prosodic boundaries in Shanghai Chinese, as it only happens after a
prosodic boundary, while it does not happen within a prosodic word (S1+S2
In F-​S3 condition, there was no significant f0 adjustment of S1 and S2.
As for the f0 adjustment of S3, contrastive focus significantly increased the
maxf0 of T1[HL] and T3[LH], while it had little effect on their minf0. As for
the f0 adjustment of S4, contrastive focus significantly reduced the maxf0
and minf0 of T3[LH] in S4 position, which reflected a PFC phenomenon.
However, it significantly increased the maxf0 of T1[HL] but had no effect on
its minf0, which could be attributed to the carryover effect from the preceding
syllable. Because the offset f0 of the preceding syllable (T3[LH]) was signifi-
cantly raised by the contrastive focus, and the onset f0 (i.e., maxf0) of T1[HL]
was dependent on the offset f0 of its preceding syllable, the maxf0 (i.e., onset
f0) of T1[HL] in S4 position was therefore increased.
In F-​S4 condition, regarding the focused syllable (S4), the maxf0 of both
T1[HL] and T3[LH] were significantly increased, while there was no signifi-
cant change of their minf0. As for the pre-​focus syllable, there was no signifi-
cant f0 adjustment of S1, S2, and S3, except that the maxf0 of T3[LH] in S3
position was significantly increased. Such a phenomenon can be explained as
the anticipatory effect. Because the onset f0 of the following syllable (T1[HL])
would be significantly raised by contrastive focus, the offset f0 (i.e., maxf0) of
T3[LH] in S3 position was raised for such anticipation.
S1 S2 S3 S4




F0 (st)




1 5 10 1 5 10 1 5 10 1 5 10
Normalized time

Focus N-F F-S1 F-S2 F-S3 F-S4

Figure 7.2 The time-​ normalized f0 contours of the four stimulus sentences (Types 1–​4), uttered in non-​
focused condition
(N-​F: red) and focused condition with contrastive focus on S1 (F-​S1: dark green), on S2 (F-​S2: green), on S3
(F-​S3: blue), and on S4 (F-​S4: purple)

Prosodic encoding 209

In summary, contrastive focus was found to significantly influence the
global f0 contour of the whole sentence, which was generally in line with the
results of Xu (1999), who studied the tonal realization and focus encoding in
Standard Chinese. In our study, a contrastive focus substantially enhanced
the f0 realization of the focused syllable, by (mainly) raising its maxf0 and
(sometimes) lowering its minf0; it lowered and compressed the f0 contour of
the post-​focus syllables in general, and had little effect on the f0 realization of
the pre-​focus syllables.
Furthermore, there are three things to be noted. First, the f0 of both
syllables within the compound word (S1+S2) are enhanced together, regard-
less of the contrastive focus position (either on S1 or S2), while in the VP
phrase, only the f0 of the focused syllable is enhanced by contrastive focus (in
F-​S3 or F-​S4 conditions). In other words, the domain of f0 enhancement in a
compound word is the whole compound, while that in the VP phrase is only
the focused syllable, which verifies that there is a prosodic boundary between
the two syllables of the VP phrase, while there is no boundary within the com-
pound word. That is, a VP phrase forms a prosodic phrase, whereas a com-
pound word forms a prosodic word (Selkirk and Shen 1990). Such results also
indicate that the contrastive focus is encoded through the prosodic structure
in Shanghai Chinese, that is, indirect encoding.
Second, the post-​focus compression only happens between S2 and S3 with
a prosodic phrase boundary in between, and happens between S3 and S4 with
a prosodic word boundary in between. However, it does not happen between
S1 and S2, that is, within a prosodic word. This result indicates that post-​
focus compression might also be sensitive to prosodic structure in Shanghai
Chinese, as it only happens after a prosodic boundary (at least a prosodic
word boundary).
Third, PFC is not the only characteristic of post-​focus tonal realization,
as the maxf0 of T1[HL] in S4 position (post-​focus syllable) was significantly
increased in F-​S3 condition. This might be caused by the carryover effect, as
the offset f0 of the preceding syllable (S3 with T3[LH]) was significantly raised
by focus. This result is in line with the results in Chen (2010), who studied the
prosodic encoding of contrastive focus in Standard Chinese. She found that
the specific f0 contours of post-​focus lexical tones are subject to the influ-
ence of (mainly) preceding as well as (sometimes) following tonal contexts.
Therefore, post-​focus lexical tones are sometimes realized with a compressed
f0 range, but they can also be realized with a more expanded f0 range than
their pre-​focus counterparts.

7.3.2 Analysis of duration and intensity

The statistical analysis of duration and intensity measurements was
conducted in two steps. First, in order to have a general understanding of
the duration and intensity distribution among the four syllable positions, the
data of the non-​focused condition was selected and two sets of linear mixed
Table 7.5 The effects of contrastive focus on the maxf0 and minf0 of each syllable

Focus Condition S1 S2 S3 S4
(T1[HH]/​T3[LL]) (T1[HH]/​T3[HL]) (T1[HL/​T3[LH]]) (T1[HL/​T3[LH]])

Estimate Std.E t p Estimate Std.E t p Estimate Std.E t p Estimate Std.E t p

F-​S1 T1 maxf0 ↑ 5.083 0.360 14.102 0.000 ↑ 5.083 0.360 14.102 0.000 ↓ -​4.577 0.521 -​8.790 0.000 ↓ -​7.458 0.460 -​16.222 0.000
minf0 ↑ 4.660 0.353 13.188 0.000 ↑ 4.660 0.353 13.188 0.000 ↓ -​3.518 0.392 -​8.975 0.000 ↓ -​4.350 0.502 -​8.662 0.000
T3 maxf0 0.149 0.222 0.673 0.501 ↑ 3.328 0.416 8.002 0.000 ↓ -​1.287 0.378 -​3.401 0.001 ↓ -​4.142 0.440 -​9.418 0.000
minf0 ↓ -​1.508 0.337 -​4.478 0.000 ↓ -​3.310 0.367 -​9.021 0.000 ↓ -​2.060 0.374 -​5.506 0.000 ↓ -​2.365 0.460 -​5.140 0.000
F-​S2 T1 maxf0 ↑ 2.879 0.360 7.986 0.000 ↑ 2.879 0.360 7.986 0.000 ↓ -​4.555 0.521 -​8.749 0.000 ↓ -​6.799 0.460 -​14.787 0.000
minf0 ↑ 2.562 0.353 7.250 0.000 ↑ 2.562 0.353 7.250 0.000 ↓ -​3.952 0.392 -​10.080 0.000 ↓ -​4.018 0.502 -​8.001 0.000
T3 maxf0 -​0.185 0.222 -​0.834 0.404 ↑ 2.156 0.416 5.183 0.000 ↓ -​1.233 0.378 -​3.259 0.001 ↓ -​4.113 0.440 -​9.354 0.000
minf0 ↓ -​0.992 0.337 -​2.946 0.003 ↓ -​4.622 0.367 -1​ 2.596 0.000 ↓ -​1.963 0.374 -​5.247 0.000 ↓ -​2.394 0.460 -​5.203 0.000
F-​S3 T1 maxf0 -​0.011 0.360 -​0.031 0.975 -​0.011 0.360 -​0.031 0.975 ↑ 4.588 0.521 8.811 0.000 ↑ 2.037 0.460 4.430 0.000
minf0 0.096 0.353 0.271 0.787 0.096 0.353 0.271 0.787 -​0.096 0.392 -​0.246 0.806 -​0.177 0.502 -​0.352 0.725
T3 maxf0 0.134 0.222 0.603 0.546 0.283 0.416 0.680 0.496 ↑ 2.251 0.378 5.950 0.000 ↓ -​2.484 0.440 -​5.648 0.000
minf0 -​0.204 0.337 -​0.606 0.544 0.283 0.367 0.771 0.441 -​0.483 0.374 -​1.291 0.197 ↓ -​1.483 0.460 -​3.224 0.001
F-​S4 T1 maxf0 -​0.245 0.360 -​0.679 0.497 -​0.245 0.360 -​0.679 0.497 0.756 0.521 1.451 0.147 ↑ 4.063 0.460 8.837 0.000
minf0 0.068 0.353 0.191 0.848 0.068 0.353 0.191 0.848 0.199 0.392 0.506 0.613 0.560 0.502 1.116 0.265
T3 maxf0 0.213 0.222 0.960 0.337 0.254 0.416 0.611 0.541 ↑ 1.114 0.378 2.944 0.003 ↑ 4.423 0.440 10.059 0.000
minf0 0.356 0.337 1.056 0.291 0.389 0.367 1.059 0.290 1.121 0.374 2.996 0.003 0.503 0.460 1.094 0.274
Table 7.6 The effects of contrastive focus on the rhyme duration and mean intensity of each syllable

Duration S1 S2 S3 S4

Estimate Std.E t p Estimate Std.E t p Estimate Std.E t p Estimate Std.E t p

F-​S1 ↑ 0.331 0.026 12.518 0.000 ↑ 0.097 0.027 3.573 0.000 ↓ -​0.051 0.022 -​2.316 0.021 ↓ -​0.178 0.027 -​6.647 0.000
F-​S2 ↑ 0.198 0.026 7.501 0.000 ↑ 0.355 0.027 13.139 0.000 0.020 0.022 0.897 0.370 ↓ -​0.143 0.027 -​5.342 0.000
F-​S3 ↑ 0.066 0.026 2.517 0.012 0.053 0.027 1.963 0.050 ↑ 0.347 0.022 15.719 0.000 ↑ 0.028 0.027 1.033 0.302
F-​S4 0.039 0.026 1.481 0.139 0.009 0.027 0.345 0.730 ↑ 0.065 0.022 2.958 0.003 ↑ 0.310 0.027 11.614 0.000
Intensity Estimate Std.E t p Estimate Std.E t p Estimate Std.E t p Estimate Std.E t p
F-​S1 ↑ 0.031 0.006 5.512 0.000 ↑ 0.030 0.005 5.558 0.000 ↓ -​0.042 0.007 -​6.464 0.000 ↓ -​0.080 0.007 -​10.812 0.000
F-​S2 ↑ 0.015 0.006 2.731 0.006 ↑ 0.036 0.005 6.803 0.000 ↓ -​0.033 0.007 -​5.066 0.000 ↓ -​0.072 0.007 -​9.645 0.000
F-​S3 0.001 0.006 0.187 0.852 0.003 0.005 0.473 0.636 ↑ 0.024 0.007 3.634 0.000 -​0.007 0.007 -​0.893 0.372
F-​S4 0.007 0.006 1.221 0.222 0.002 0.005 0.410 0.682 ↑ 0.016 0.007 2.511 0.012 ↑ 0.038 0.007 5.115 0.000

212 Bijun Ling and Jie Liang

effects models were constructed on the data of rhyme duration and mean
intensity respectively, with syllable position (S1, S2, and S3 versus S4), tone
type (T1[HL] versus T3[LH]) and their interactions as fixed factors, and
with speaker and repetition as random factors. Second, in order to investigate
the effects of contrastive focus on the duration and intensity patterns, the data
was split according to syllable position first, and then a series of linear mixed
effects models were constructed on the data of each syllable respectively, with
focus condition (N-​F, F-​S1, F-​S2, F-​S3 versus F-​S4) as fixed factor, and with
speaker and repetition as random factors.
In the non-​ focused condition (see Figure 7.3), results showed the
following: (1) With regard to the rhyme duration, the duration of S3 and S4
were significantly longer than that of S1 (S3: Estimate = 0.101, S.E. = 0.033,
t = 3.024, p = 0.002; S4: Estimate = 0.084, S.E. = 0.033, t = 2.525, p = 0.012),
and there was no significant difference between S2 and S1. Furthermore,
the rhyme duration of T3[LH] was significantly longer than that of T1[HL]
(Estimate = 0.078, S.E. = 0.033, t = 2.352, p = 0.019). (2) With regard to the
mean intensity, the intensity of S2 was significantly smaller than that of S1
(Estimate = -​0.024, S.E. = 0.007, t = -​3.392, p = 0.001), while the intensity of
S3 was significantly larger than that of S1 (Estimate = 0.015, S.E. = 0.007,
t = 2.108, p = 0.035). Furthermore, in general, the intensity of T3[LH] was sig-
nificantly smaller than that of T1[HL] (Estimate = -​0.044, S.E. = 0.007, t = -​
6.191, p<0.001). However, there was significant interaction of syllable position *
tone type. In S2 position, the intensity of T3[LH] was significantly larger than
that of T1[HL] (Estimate = 0.086, S.E. = 0.010, t = 8.562, p<0.001). In con-
trast, in S4 position, the intensity of T3[LH] was still significantly smaller
than that of T1[HL], and the magnitude of difference was even larger than
others (Estimate = -​0.024, S.E. = 0.010, t = -​2.336, p = 0.020).
With regard to the effect of contrastive focus on the rhyme duration (see
Table 7.6), results showed the following: (1) When the contrastive focus was
located on S1 or S2 (in F-​S1 or F-​S2 conditions), the rhyme duration of both
S1 and S2 was significantly lengthened, while the duration of S4 was signifi-
cantly shortened. It should be noted that in the F-​S1 condition, the rhyme
duration of S3 was also significantly shortened. (2) When the contrastive focus
was located on S3 or S4 (in the F-​S3 or F-​S4 conditions), the rhyme duration
of S3 and S4 was significantly lengthened, and the lengthening magnitude of
focused syllable was more significant. It should be noted that the rhyme dur-
ation of S1 and S2 were also lengthened but not significantly, except for S1 in
the F-​S3 condition.
With regard to the effect of contrastive focus on the mean intensity (also
see Table 7.6), results showed the following: (1) When the contrastive focus
was located on S1 or S2 (in the F-​S1 or F-​S2 conditions), the mean intensity
of both S1 and S2 was significantly increased, while the intensity of S3 and
S4 was significantly reduced. (2) When the contrastive focus was located on
S3 (in F-​S3 condition), the intensity of S3 was significantly increased, and the
intensity of S4 was decreased but not significantly. (3) When the focus was

Prosodic encoding 213

on S4 (in F-​S4 condition), the intensity of both S3 and S4 was significantly
increased, and the increasing magnitude of S4 was more significant than that
of S3. In both F-​S3 and F-​S4 conditions, there was no significant change in
the intensity of S1 and S2.
In summary, the duration and intensity patterns in the non-​focused con-
dition suggest that compound words and VP phrases have different prosodic
structures, as the rhyme duration of S3 and S4 is significantly longer than
that of S1, and the mean intensity of S3 is significantly larger than that of
S1, which indicates that S3 is at a higher prosodic hierarchy than S1, because
the strength of the boundary is closely correlated with the pre-​boundary
lengthening and intensity of the material preceding the boundary (Wagner
and Watson 2010).
With regard to the effects of contrastive focus, it significantly increased the
intensity of the focused syllable, and significantly decreased the intensity of
the post-​focus syllables, while leaving that of pre-​focus words largely intact.
However, there is one exception. When the contrastive focus was located on
S4 (i.e., the object), the intensity of S3 (i.e., the verb) was also significantly
increased, which is in line with the focus projection analysis proposed by
Selkirk (1995). She argued that an acoustic prominence on the head of a
phrase or its internal argument can project to the entire phrase, thus making
the entire phrase focused (see also Selkirk 1984; Gussenhoven 1983, 1999, for
a similar claim). With regard to the duration adjustment, although the con-
trastive focus was only located on one syllable (S1/​S2 or S3/​S4), the duration of
both syllables within one prosodic unit (e.g. S1 and S2, or S3 and S4) was sig-
nificantly lengthened, which reflects a spillover lengthening effect (the duration
lengthening of syllables outside the focus), that has been reported in many
languages (e.g., Cambier-​Langeveld and Turk 1999; Sluijter 1996 for English;
Heldner and Strangert 2001 for Swedish; Chen 2005 for Standard Mandarin).
Furthermore, there is one more matter to be noted. Generally, the dur-
ation of T3[LH] is significantly longer than that of T1[HL], while the inten-
sity of T3[LH] is significantly smaller, which is in line with the results in Ling
and Liang (2016). Because T3[LH] in Shanghai Chinese is a breathy phon-
ation, in which one-​third of the length of vocal folds is open while they are
vibrating (Cho and Ladefoged 1999), as a result, considerable air escapes
through the hole during voicing, and in turn the intensity of the vowel
following breathy stops is smaller and the duration of the vowel is longer.
Interestingly, the intensity of T3[LH] in S2 position is significantly larger than
that of T1[HL], which indicates that T3[LH] in S2 position is no longer the
breathy phonation, but rather has become a real voiced sound with modal
phonation, which verifies the proposition that a voiced consonant in Wu dia-
lect is voiceless consonant followed by a vowel with breathy phonation when
it is at the beginning of a word. However, it is a real voiced consonant with
modal phonation when it is at the non-​initial position (Chao 1967; Cao and
Maddieson 1992; Chen 2014). It should be noted that the intensity of T3[LH]
in S4 position is still significantly smaller than that of T1[HL], which indicates

214 Bijun Ling and Jie Liang

S1 S2 S3 S4

Relative Rhyme Duration



T1[HL] T3[LH] T1[HL] T3[LH] T1[HL] T3[LH] T1[HL] T3[LH]


Tone T1[HL] T3[LH]

Figure 7.3 Box plots of the rhyme duration (left) and mean intensity (right) of each
target syllable. The middle line represents the median, the box represents
the interquartile range (1st to 3rd quartile), and the whiskers represent
maximally 1.5 times the interquartile range

that T3[LH] is still breathy phonation in S4 position. In other words, it does

not change to the voiced modal phonation because it is at the beginning of a
prosodic word, which reconfirms that S3 and S4 form a prosodic phrase with
a prosodic boundary in between from another angle.

7.4 Discussion and conclusion

In this chapter, we first examined the f0, duration, and intensity patterns of
disyllabic compound words and VP phrases in Shanghai Chinese, and fur-
ther investigated the effects of contrastive focus on these patterns, in order

Prosodic encoding 215

S1 S2 S3 S4

Relative Mean Intensity



T1[HL] T3[LH] T1[HL] T3[LH] T1[HL] T3[LH] T1[HL] T3[LH]


Tone T1[HL] T3[LH]

Figure 7.3 (Cont.)

to further our understanding of the prosodic encoding mechanism of con-

trastive focus in tonal languages, and to shed some light on the relationship
between focus encoding and prosodic structure in Shanghai Chinese. There
are several interesting findings to be noted.

7.4.1 The prosodic structure of compound words and VP phrases

The different f0 patterns of compound word and VP phrase confirmed that
in Shanghai Chinese, the compound word undertakes left-​dominant sandhi,
while VP phrase undertakes right-​dominant sandhi. In other words, the appli-
cation of left-​or right-​dominant sandhi is dependent on the morphosyntactic
structures (Xu and Tang 1988; Zhang and Meng 2016). Furthermore, a com-
pound word undertakes the left-​dominant tone sandhi, which spreads the
tone contour of the initial syllable across the entire word. In other words, the

216 Bijun Ling and Jie Liang

base tone contrast of the second syllable is neutralized and accepts the tonal
target assigned by the first syllable. In our study, the f0 pattern of the T1+T3
compound is realized as a high-​level tone plus a high falling tone (described as
[HH+HL]), and the f0 pattern T3+T1 compound is realized as a low-​falling
tone plus a high-​level tone (described as [LL+HH]), which is in line with the
literature (Zee and Maddison 1980; Xu and Tang 1988; Zhu 1999). However,
there is some divergence from the literature with regard to the right-​dominant
sandhi. According to Xu and Tang (1988), a VP phrase should undertake the
right-​dominant sandhi, which retains the tone of the final syllable and levels
the contour of the non-​final tone. However, in our study, the tonal contours
of both syllables in a VP phrase were retained as high falling for T1[HL] and
low rising for T3[LH], without any leveling of the f0 contour of the first syl-
lable (S3). We only selected two tonal combinations (T1[HL]+T3[LH] and
T3[LH]+T1[HL]) in this study, and both combinations are a congruous
tonal combination, which might inhibit the leveling process of the first syl-
lable. If we choose incongruous tonal combinations, like T1[HL]+T1[HL] or
T3[LH]+T3[LH], will the f0 contour of the first syllable be realized in a more
leveling way, as recorded in the literature (Xu and Tang 1988). However, Ling
and Liang (2017) investigated the f0 realization of 16 tonal combinations (4
base tones of σ1*4 base tones of σ2, except T1[HL]) of disyllabic VP phrases,
and found that only the f0 contour of T2[MH] of the first syllable was realized
as a high-​leveling tone, which is consistent with the literature, while the f0
contours of T3[LH] and T5[LHq] of the first syllable were realized as a low
rising tone, which is consistent with their citation tones. Therefore, according
to the existing acoustic evidence, the right-​dominant sandhi is more likely a
phonetic contour reduction caused by tonal coarticulation as proposed by
Takahashi (2013) and Zhang and Meng (2016), rather than a phonological
tone leveling process as proposed by Xu and Tang (1988) and Duanmu
(2005). All in all, compound words and VP phrases adopt different sandhi
patterns in Shanghai Chinese, which indicates the application of sandhi
patterns is dependent on the morphosyntactic structures. Furthermore, the
left-​dominant sandhi is a phonological rightward tone spreading, with the
base tone contrast of the second syllable neutralized and assigned a new tonal
target by the first syllable. However, the nature of the right-​dominant sandhi
is still unclear and requires further investigations with more morphosyntactic
structures and more tonal combinations.
Furthermore, the duration and intensity patterns of compound words and
VP phrases indicates that compound words and VP phrases are composed of
different prosodic structures in Shanghai Chinese, which supports the ana-
lysis of Selkirk and Shen (1990). Selkirk and Shen (1990) first proposed that
the application of left-​or right-​dominant sandhi is dependent on the prosodic
structure: a prosodic word, comprised of one prosodic domain, is supposed
to undertake left-​dominant sandhi, while a prosodic phrase, comprised of
two prosodic domains, is supposed to undertake right-​dominant sandhi. In
a non-​focused condition, the rhyme duration of S3 and S4 is significantly

Prosodic encoding 217

longer than that of S1, and the mean intensity of S3 is significantly larger
than that of S1. According to Wagner and Watson (2010), the strength of a
boundary is closely correlated with the pre-​boundary lengthening and inten-
sity of the material preceding the boundary. Since the rhyme duration and
mean intensity of S3 are significantly larger than those of S1, S3 should be at
a higher prosodic hierarchy than S1. Moreover, the contrastive focus induced
different adjustments of f0, duration, and intensity of compound words and
VP phrases, which reconfirmed that a VP phrase is composed of a prosodic
phrase, while a compound word is composed of a prosodic word. Detailed
analysis will be elaborated in 7.4.2.

7.4.2 The effects of contrastive focus on f0, duration, and intensity patterns
Contrastive focus is prosodically encoded through the global adjustment of
f0, duration, and intensity of the whole sentence, which can be summarized
as “Tri-​zone adjustments”, as Xu (1999) proposed for the prosodic encoding
of focus in Standard Chinese. In Shanghai Chinese, the contrastive focus
also has little effect on the f0, duration, and intensity patterns of the
pre-​focus constituents, while it is phonetically realized mainly through
adjusting the f0, duration, and intensity patterns of focused and post-​focus
With regard to the focused constituents, a contrastive focus substantially
enhances the f0 realization of the focused syllable by (mainly) raising the
maxf0 and (sometimes) lowering the minf0; it also significantly increases the
intensity and lengthens the duration of the focused syllable. It should be noted
that the f0 and intensity of both syllables within the compound word (S1+S2)
are always enhanced or increased together, although the contrastive focus was
only located on one syllable, either on S1 or S2. In contrast, in VP phrases,
only the f0 and intensity of the focused syllable is enhanced or increased by
contrastive focus (in F-​S3 or F-​S4 conditions). In other words, the adjustment
domain of f0 and intensity in a compound word is the whole compound, while
that in a VP phrase is only the focused syllable, which verifies that there is a
prosodic boundary between the two syllables of the VP phrase, while there is
no boundary within the compound word. That is, a VP phrase is composed of
a prosodic phrase, while a compound word is composed of a prosodic word
(Selkirk and Shen 1990). Such a result also indicates that the focus-​induced
f0 and intensity adjustment are mediated through the prosodic structure in
Shanghai Chinese, which supports the indirect encoding analysis.
However, the focus-​induced duration adjustment pattern is different from
the adjustment patterns of f0 and intensity, because not only the duration of
both syllables within the compound word (i.e., S1 and S2) were significantly
lengthened by contrastive focus (in F-​S1 and F-​S2 conditions), but also the
duration of both syllables in the VP phrase (i.e., S3 and S4) were also both
significantly lengthened by focus (in F-​S3 and F-​S4 conditions), although the
contrastive focus was only located on one syllable. It seems that the duration

218 Bijun Ling and Jie Liang

adjustment is not linked to the prosodic structure. However, that is not true,
because in F-​S1 and F-​S2 conditions, there was no significant difference
between the lengthening magnitude of S1 and S2, while in F-​S3 and F-​S4
conditions, the lengthening magnitude was more significant on the focused
syllable. Such results indicate that the simultaneous lengthening of S1 and S2
is due to their being in one prosodic unit and, therefore, they are affected as a
whole unit by the contrastive focus. In contrast, the simultaneous lengthening
of S3 and S4 is due to the spillover lengthening effect (the duration lengthening
of syllables outside the focus), which has been reported in many languages
(e.g., Cambier-​Langeveld and Turk 1999; Sluijter 1995 for English; Heldner
and Strangert 2001 for Swedish; Chen 2005 for Standard Mandarin). That is
why there is lengthening magnitude difference between them.
With regard to the post-​focus constituents, generally speaking, the f0, dur-
ation, and intensity all were significantly reduced, which are signs of hypo-​
articulation. In other words, the f0 reduction reflects the weak implementation
of post-​focus lexical tones, which is due to the fact that they are associated with
prosodically non-​prominent constituents and are therefore hypo-​articulated,
as Chen (2010) proposed. However, there are several things to be noted. In
both F-​S1 and F-​S2 conditions, the f0 and intensity of S3 and S4 were signifi-
cantly lowered and reduced, which is an obvious PFC phenomenon. However,
in F-​S1 condition, although S2 is the post-​focus syllable, its f0, intensity, and
duration were not decreased but rather were enhanced together with S1, as
elaborated above. Therefore, PFC only occurred on S3 and S4, but not on S2.
In F-​S3 condition, the f0 and intensity of S4 were also decreased, but its dur-
ation was lengthened because of the spillover lengthening effect. These results
indicate that the post-​focus compression is also sensitive to prosodic structure
in Shanghai Chinese. PFC does not happen within a prosodic word (S1+S2
compound) but only after a prosodic boundary, like between S2 and S3 with
a phrase boundary in between, or between S3 and S4 with a prosodic word
boundary in between. Therefore, PFC is a prosodic boundary marker (at least
a prosodic word boundary) in Shanghai Chinese.
Furthermore, when the contrastive focus was located on S3, there were
two situations. When the VP phrase is a T1[HL]+T3[LH] combination, the f0
and intensity of S4 were significantly reduced, that is, a PFC effect; however,
when it is a T3[LH]+T1[HL] combination, the f0 of S4 was not lowered but
rather raised, which is caused by the carryover effect of the significant raising
of the offset f0 of the preceding syllable. Therefore, the f0 range compres-
sion is not the only or primary characteristic of post-​focus tonal realization,
which is in line with the results of Chen (2010), who studied focus encoding in
Standard Chinese. She found that the specific f0 contours of post-​focus lexical
tones are subject to the influence of (mainly) preceding as well as (sometimes)
following tonal contexts. Post-​focus lexical tones are sometimes realized with
a compressed f0 range, but they can also be realized with a more expanded
f0 range than their pre-​focus counterparts. Therefore, she proposed “the

Prosodic encoding 219

observed post-​focus effects may be viewed as the multifaceted manifestations
of weak implementation of post-​focus tonal targets, as they are associated
with prosodically non-​prominent constituents” (Chen 2010: 517).
Last but not least, although the f0 and intensity of both S3 and S4 are
significantly reduced in both F-​S1 and F-​S2 conditions, the duration of S3
and S4 had different adjustments in these two conditions. In F-​S1 condi-
tion, the duration of both S3 and S4 were significantly reduced, while in
F-​S2 condition, only the duration of S4 was significantly reduced, while
that of S3 was slightly increased, which was probably caused by the spillover
lengthening effect. However, such results indicate that the effect of focus
may not be purely structural, as proposed for non-​tonal languages, in which
on-​focus constituents are associated with pitch accents, while post-​focal
constituents are deaccented (Selkirk 1996; Ladd 1996; Frey and Kugler
2008). However, the nature of focus-​induced acoustic modification is a more
gradient process.

7.4.3 The relationship between focus encoding and prosodic structure

According to the elaboration above, the focus-​ induced f0 and intensity
adjustments are different between compound words (S1+S2) and VP phrases
(S3+S4). In particular, the f0 and intensity of both syllables in the compound
word were enhanced or increased together, regardless of the focus position
(on S1 or S2). In contrast, in VP phrases, only the f0 and intensity of the
focused syllable were enhanced or increased by focus. Therefore, the domain
of f0 and intensity adjustment in compound words is the whole compound,
while that in VP phrases is only the focused syllable. The focus domain and
position are the same, but the domains of f0 and intensity adjustment are
different. The only explanation is that compounds and phrases have different
prosodic structures, and the focus encoding is mediated via prosodic structure
in Shanghai Chinese.
Furthermore, there is a clear post-​focus compression effect in Shanghai
Chinese, as the f0, intensity, and duration of post-​focus syllables were sig-
nificantly reduced, especially those acoustic cues of S3 and S4 in F-​S1 and
F-​S2 conditions. However, PFC does not happen within a prosodic word (i.e.,
between S1 and S2) but only after a prosodic boundary, that is, between S2
and S3 with a phrase boundary, or between S3 and S4 with a prosodic word
boundary. Therefore, post-​focus compression effect is also sensitive to pros-
odic structure, and it performs as a prosodic boundary marker (at least a pros-
odic word boundary) in Shanghai Chinese.
In summary, based on the focus-​ induced f0, duration, and intensity
adjustments both on-​focus and post-​focus, we propose that the contrastive
focus is phonetically encoded through the prosodic structure in Shanghai
Chinese, which offered empirical evidence for the indirect encoding analysis
(Truckenbrodt 1995; Selkirk 2006).

220 Bijun Ling and Jie Liang

7.5 Conclusion
In this chapter, we examined the f0, duration, and intensity patterns of disyl-
labic compound words and VP phrases in Shanghai Chinese, and further
investigated the effects of contrastive focus on these patterns. The f0, dur-
ation, and intensity patterns of modifier-​noun compounds and verb-​noun
phrases in normal condition and their different adjustments induced by
contrastive focus, confirmed that the application of left-​or right-​dominant
sandhi is dependent on the morphosyntactic structure, which is composed
of different prosodic structures. Furthermore, the prosodic encoding of con-
trastive focus is mediated through prosodic structure in Shanghai.

Bates, D., Maechler, M., Bolker, B., and Walker, S. (2014). lme4: Linear mixed-​effects
models using Eigen and S4. R package version 1.1–​12.
Boersma, P., and Weenink, D. (2010). Praat: Doing phonetics by computer (Version
5.1.30) [Computer program]. Retrieved from
Bartels, C., and Kingston, J. (1994) “Salient pitch cues in the perception of contrastive
focus” The Journal of the Acoustical Society of America, 95(5), p. 2973
Baumann, S., Grice, M., and Steindamm, S. (2006) “Prosodic marking of focus
domains –​categorical or gradient?” in Proceedings of speech prosody, Dresden,
Germany, May 2–5, 2006, pp. 301–304. (
Beckman, M. E. (1986) Stress and non-​stress accent. Netherlands Phonetic Archives
Series No. 7. Dordrecht: Foris.
Birner, B. (1994) “Information status and word order: An analysis of English inver-
sion”, Language, 70(2), pp. 233–​259.
Breen, M., et al. (2010) “Acoustic correlates of information structure”, Language and
Cognitive Processes, 25(7), pp. 1044–​1098.
Cambier-​Langeveld, T., and Turk, A. (1999) “A cross-​linguistic study of accentual
lengthening: Dutch vs. English”, Journal of Phonetics, 27, pp. 255–​280.
Cao, J.-​F., and Maddieson, I. (1992) “An exploration of phonation types in Wu dialects
of Chinese”, Journal of Phonetics, 20, pp. 77–​92.
Chao, Y-​R. (1967) “Contrastive aspects of the Wu dialects”, Language, 43, pp. 92–​101.
Chen, M. (2000) Tone Sandhi. Cambridge: Cambridge University Press.
Chen, Y.-​ Y. (2005) “Durational adjustment under contrastive focus in Standard
Chinese”, Journal of Phonetics, 34, pp. 176–​201.
Chen, Y-​ Y. (2006) “Durational adjustment under corrective focus in Standard
Chinese”, Journal of Phonetics, 34, pp. 176–​201.
Chen, Y.-​Y. (2010) “Post-​focus F0 compression –​Now you see it, now you don’t”,
Journal of Phonetics, 38, pp. 517–​525.
Chen, Y.-​Y., and Gussenhoven, C. (2008) “Emphasis and tonal implementation in
Standard Chinese”, Journal of Phonetics, 36, pp. 724–​746.
Chen, Y.-​ Y., and Gussenhoven, C. (2015) “Shanghai Chinese”, Journal of the
International Phonetic Association, 45, pp. 321–​337.
Chen, Z.-​M. (2014) “On the relationship between tones and initials of the dialects in
the Shanghai area.” Paper presented at the 4th International Symposium on Tonal
Aspects of Language (TAL-​2014), Nijmegen, the Netherlands, 13–​16 May 2014.

Prosodic encoding 221

Cho, T., and Ladefoged, T. (1999) “Variation and universals in VOT: Evidence from 18
languages”, Journal of Phonetics, 27(2), pp. 207–​229.
Clark, E. V., and Clark, H. H. (1978) “Universals, relativity, and language pro-
cessing” in Greenberg, J. H. (ed.) Universals of human language, vol. 1. Stanford,
CA: Stanford University Press, pp. 225–​277.
Cooper, W. E., Eady, S. J., and Mueller, P. R. (1985) “Acoustical aspects of contrastive
stress in question–​answer contexts”, Journal of the Acoustical Society of America,
77, pp. 2142–​2156.
Cruttenden, A. (2006) “The deaccenting of given information: A cognitive universal”
in Bernini, G., and Schwarz, M. L. (eds.) The pragmatic organization of discourse in
the languages of Europe. Berlin: Mouton de Gruyter, pp. 311–356.
Duanmu, S. (2005). “The tone-​syntax interface in Chinese: Some recent controversies”,
in Kaji, S., Daigaku, T. G., and Kenkyūjo, A. A. G. B. Proceedings of Symposium
on Cross-​ Linguistic Studies of Tonal Phenomena. Tokyo: Research Institute for
Languages and Cultures of Asia and Africa (ILCAA), Tokyo University of Foreign
Studies, pp. 1–36..
Eady, S. J., and Cooper, W. E. (1986) “Speech intonation and focus location in matched
statements and questions”, Journal of the Acoustical Society of America, 80, pp.
Eady, S. J., Cooper, W. E., Klouda, G., Mueller, P., and Lotts, D. (1986) “Acoustical
characteristics of sentential focus: Narrow vs broad and single vs. dual focus envir-
onments”, Language and Speech, 29, pp. 233–​251.
Fery, C., and Kugler, F. (2008) “Pitch accent scaling on given, new and focused
constituents in German”, Journal of Phonetics, 36, pp. 680–​703.
Garding, E. (1987) “Speech act and tonal pattern in Standard Chinese”, Phonetica,
44, pp. 13–​29.
Gussenhoven, C. (1983) “Testing the reality of focus domains”, Language and Speech,
26, pp. 61–​80.
Gussenhoven, C. (1984) On the grammar and semantics of sentence accents.
Dordrecht: Foris.
Gussenhoven, C. (1999) “On the limits of focus projection in English” in Bosch, P., and
van der Sandt, R. (eds.) Focus: Linguistic, cognitive, and computational perspectives.
Cambridge: Cambridge University Press, pp. 43–​55.
Gussenhoven, C., Repp, B. H., Rietveld, A., Rump, H. H., and Terken, J. (1997) “The
Perceptual Prominence of Fundamental Frequency Peaks”, Journal of Acoustical
Society of America, 102(5), pp. 3009–3022.
Heldner, M., and Strangert, E. (2001). “Temporal effects of focus in Swedish”, Journal
of Phonetics, 29, pp. 329–​361.
Ishihara, I., and Fery, C. (2006) “Phonetic correlates of second occurrence focus” in
Davis, C., Deal, A. R., and Zabbal, Y. (eds.) Proceedings of the 36th North-​Eastern
Linguistics Society (NELS 36) Amherst: GLSA (Graduate Linguistic Student
Association), Dept. of Linguistics, South College, University of Massachusetts,
pp. 371–​384.
Ishihara, S. (2011) “Japanese focus prosody rev prosodic phrasing”, Lingua, 121, pp.
Kochanski, G., Grabe, E., Coleman, J., and Rosner, B. (2005) “Loudness predicts
prominence: fundamental frequency lends little”, Journal of the Acoustical Society
of America, 118(2), pp. 1038–​1054.
Ladd, D. R. (1980) The structure of intonational meaning: Evidence from English.
Bloomington: Indiana University Linguistic Club.

222 Bijun Ling and Jie Liang

Ladd, D. R. (1996) Intonational phonology. Cambridge Studies in Linguistics 79.
Cambridge: Cambridge University Press.
Ladd, D. R., and Morton, R. (1997) “The perception of intonational
emphasis: Continuous or categorical?”, Journal of Phonetics, 25, pp. 313–​342.
Lambrecht, K. (2001) “A framework for the analysis of cleft constructions”,
Linguistics, 39, pp. 463–​516.
Lieberman, P. (1960) “Some acoustic correlates of word stress in American English”,
Journal of the Acoustical Society of America, 32(4), pp. 451–​454.
Ling, B.-​J., and Liang, J. (2016) “The influence of syllable structure and prosodic
strengthening on consonant production in Shanghai Chinese”, in Lee, T. (ed.)
Proceedings of the 10th International Symposium on Chinese Spoken Language
Processing (ISCSLP 2016), 17–​20 October 2016, Piscataway, NJ: IEEE (Institute
of Electrical and Electronics Engineers), pp. 1–​5.
Ling, B.-​J., and Liang, J. (2017) “Focus encoding and prosodic structure in Shanghai
Chinese”, Journal of the Acoustical Society of America, 141(6), pp. 610–​616.
Pierrehumbert, J. B. (1980) The phonology and phonetics of English intonation.
Unpublished Ph.D. Diss., Massachusetts Institute of Technology.
Rietveld, A. C. M., and Gussenhoven, C. (1985) “On the relation between pitch excur-
sion size and prominence”, Journal of Phonetics, 13, pp. 299–​308.
Rietveld, T., and Chen, A.-​J. (2006) “How to obtain and process perceptual judgements
of intonational meaning”, in Sudhoff, S., Lenortová, D., Meyer, R., Pappert, S.,
Augurzky, P., Mleinek, I., Richter, N., and Schieβer, J. (eds.) Methods in empirical
prosody research. Berlin: Walter de Gruyter, pp. 283–​319.
R Core Team. (2016) R: A language and environment for statistical computing (version
3.3.2). Vienna, Austria: R Foundation for Statistical Computing.
Selkirk, E. (1984) Phonology and syntax: The relation between sound and structure.
Cambridge, MA: MIT Press.
Selkirk, E. (1995) “Sentence prosody: Intonation, stress, and phrasing” in Goldsmith,
J. (ed.) The handbook of phonological theory. Oxford: Blackwell, pp. 550–​569.
Selkirk, E. (1996) “Sentence prosody: Intonation, stress and phrasing” in Goldsmith,
J. (ed.) The handbook of phonological theory. Oxford: Blackwell, pp. 550–​569.
Selkirk, E. (2006) “Bengali intonation revisited: An optimality theoretic analysis in
which FOCUS stress prominence drives FOCUS phrasing” in Lee, C., Gordon,
M., and Buring, D. (eds.) Topic and focus: Cross-​linguistic perspectives on meaning
and intonation. Dordrecht: Springer, pp. 215–​244.
Selkirk, E., and Shen, T. (1990) “Prosodic domains in Shanghai Chinese” in Inkelas,
S., and Zec, D. (eds.) The phonology-​syntax connection. Chicago: University of
Chicago Press, pp. 313–​337.
Shih, C.-​L. (1988) “Tone and intonation in Mandarin”, Working Papers of the Cornell
Phonetics Laboratory, 3, pp. 83–​109.
Sluijter, A., and van Heuven, V. (1995) “Effects of focus distribution, pitch accent
and lexical stress on the temporal organization of syllables in Dutch”, Phonetica,
5, pp. 71–​89.
Sluijter, A., and van Heuven, V. (1996) “Spectral balance as an acoustic correlate of
linguistic stress”, Journal of the Acoustical Society of America, 100, pp. 2471–​2485.
Takahashi, Y. (2013) The phonological structure of Shanghai tone sandhi. Ph.D.
Diss., Tokyo University of Foreign Studies.
Terken, J. (1991) “Fundamental frequency and perceived prominence accented
syllables”, Journal of the Acoustical Society of America, 89, pp. 1768–​1776.

Prosodic encoding 223

Truckenbrodt, H. (1995) Phonological phrases: Their relation to syntax, focus, and
prominence. Ph.D. Diss., Massachusetts Institute of Technology.
Turk, A., and Sawusch, J. (1996) “The processing of duration and intensity cues to
prominence”, Journal of the Acoustical Society of America, 99, pp. 3782–​3790.
Wagner, M., and Watson, D. G. (2010) “Experimental and theoretical advances in
prosody: A review”, Language and Cognitive Processes, 25, pp. 905–​945.
Xu, B.-​H., and Tang, Z.-​Z. (1988) Shanghai Shiqu Fangyan Zhi [A grammar of inner
city Shanghai]. Shanghai: Shanghai Jiaoyu Chubanshe (Shanghai Education
Xu, Y. (1999) “Effects of tone and focus on the formation and alignment of F0
contours”, Journal of Phonetics, 27, pp. 55–​105.
Xu, Y. (2005) “Speech melody as articulatorily implemented communicative
functions”, Speech Communication, 46, 220–​251.
Xu, Y. (2013) “ProsodyPro –​A tool for large-​scale systematic prosody analysis” in
Proceedings of Tools and Resources for the Analysis of Speech Prosody (TRASP
2013). Aix-​en-​Provence, France.
Xu, Y., and Xu, C. X. (2005) “Phonetic realization of focus in English declarative
intonation”, Journal of Phonetics, 33, pp. 159–​197.
Yip, M. (2002) Tone. Cambridge: Cambridge University Press.
Zhang, J., and Meng, Y-​L. (2016) “Structure-​dependent tone sandhi in real and nonce
disyllables in Shanghai Wu”, Journal of Phonetics, 54, pp. 169–​201.
Zhu, X.-​N. (1999) Shanghai tonetics. München: Lincom Europa.
Zee, E., and Maddieson, I. (1980) “Tones and tone sandhi in Shanghai: Phonetic evi-
dence and phonological analysis”, Glossa, 14, pp. 45–​88.

Part III

Interface between prosody and


What kinds of processes are
And how powerful are they?
Ellen M. Kaisse1

8.1 Introduction
The literature on phonological processes that apply between content
words is full of cases like tone sandhi in Chinese languages, tone spread in
Bantu, the placement of intonational boundary tones, and local cases of
resyllabification, vowel deletion, voicing assimilation, or place assimilation
between the final segment of one word and the initial segment of the next.
But some kinds of processes are profoundly underrepresented. Vowel har-
mony rarely extends beyond the word or clitic group, and in the very few, less
familiar, cases reported to extend into the next full word, it often extends only
one syllable onward, not iterating so as to affect the whole word, as its lexical
counterparts do. Stress assignment almost always seems to be word bounded,
or clitic-​group bounded at the extreme. Similarly, processes of consonant
harmony –​the spread of a consonantal feature such as nasality, anteriority,
or pharyngealization –​are typically bounded by the word. In this chapter,
I survey the processes in the phonological literature that have been described
as applying across content words. I will then speculate on why postlexical
application is so strongly skewed toward certain kinds of processes and not
others. Because a survey of postlexical rules perforce must cover a great deal
of ground, I will concentrate here on the question of vowel harmony, com-
paring and contrasting it with processes involving tone, but will include some
discussion of the other kinds of cases mentioned above. I will not be looking
at processes that include closed class, function morphemes that fall outside
the morphological word but that arguably lie within the same prosodic word –​
hence my insistence here on operations between “full” or “content” words.
It is well known that function words, especially monosyllabic, closed-​class
items, can behave very much like affixes for the purposes of stress rules and
vowel harmony, among many others, so they are not my focus. But later in
this chapter we will speculate on why such closed-​class items can be available
to harmony and stress.
Much of the informal typology reported here comes from simple obser-
vation of the literature on postlexical processes, which I have followed
closely and to which I have contributed for several decades. Additionally,

228 Ellen M. Kaisse

A. Livingston and I reread the articles and language descriptions in several
articles, books, and journal special issues devoted to interword phonology,
including Kaisse (1985), Nespor and Vogel (1986; hereafter N&V), Phonology
Yearbook 4 (1987), Inkelas and Zec (1990), Selkirk (2011), and Phonology
32.1 (2015). We catalogued about 60 cases. There were no cases of vowel or
consonant harmony or the first pass of footing (stress assignment) beyond the
prosodic word. There were many local processes, as detailed in the text, and
many processes involving tone. I then added to these cases those I encountered
when searching deliberately for harmony and footing processes between
words. The results reported here are sufficiently reliable, wide-​ranging, and
balanced only for an initial foray into the questions that the current article
addresses. A more thorough survey, based on more cases and balanced for
language family and other confounding factors, is still needed. My suspicion
is that it would largely uphold the generalizations reported here, but as this
kind of typological inquiry is new territory, to my knowledge, one cannot, of
course, be certain.
Postlexical processes (Kiparsky 1982 et. seq.) are ones that can “see” into
the next word or words –​they are not word bounded. Some of these rules
are highly sensitive to the prosodic structure that is derived from syntactic
structure, while others seem to apply across the board, possibly sensitive
only to pausing. While it would be interesting to pursue the patterning of
these different kinds of processes, in this chapter I will make the simplifying
assumption that they are all of a piece. We will be concerned only with the type
of process (place assimilation, iterative vowel harmony, stress assignment,
etc.) and whether it typically extends between content words. Our conclu-
sion will be that the typical postlexical rule is strictly local, applying only
between adjacent segments –​that is, between the last segment of one word
and the first segment of the following word. Iterative rules like vowel har-
mony rarely iterate beyond the word boundary, and when they do, they usu-
ally only extend so as to affect one adjacent vowel in an abutting word. But,
in the words of Hyman (2011), tone is different. Pitch, as employed in both
tone and intonation, is particularly prone to long-​distance effects that extend
many syllables into an adjacent word. But we will see that even the postlexical
version of a tone rule can exhibit the same one-​vowel-​only restriction that we
observe in vowel harmony.
If these generalizations are true, why should they be so? We will specu-
late on a variety of reasons. For one thing, to be phonologized, a process
should have strong phonetic precursors. But iterative processes like vowel har-
mony have precursors that get weaker the further away the target lies from the
trigger. Another reason may involve the frequent collocation of words that
can lead to phonologization. While clitics, bound morphemes, and function
words appear over and over again with the same stems, independent content
words are not usually in frequent collocation with one another. A third reason
may be that neutralizing processes that operated regularly across whole words,
wiping out phonemic vowel contrasts that distinguish one content word from

What kinds of processes are postlexical? 229

another, might be disfavored. In a similar, information-​preserving vein, stress
assignment processes, with their property of assigning one, culminative main
stress, can be useful in marking the boundaries of words. If footing were to
regularly take in entire phrases, a good cue to the division between words
would be lost. On the other hand, tone is particularly prone to long-​distance
effects. Its exact position in a string can be hard to locate, and particularly in
Bantu languages, it bears a low functional load.

8.2 Typical postlexical processes

Most interword phonological processes are local: They apply between the last
segment of some word A and the first segment of the following word B.

(1) [word A]‿[word B]

A familiar example is the voicing assimilation that is often found between

the final consonant of some word A and the initial consonant of the next
word B, as in the voicing that affects /​s/​in many dialects of Spanish (Dykstra
1955 a.o.).2

(2) doˈlores doˈlores ˈkanta doˈlorez ˈgana

Dolores (name) Dolores sings Dolores wins

While the number of languages that exhibit this process is legion, we

can also mention Modern Greek (Arvaniti 1999 a.o.) for word-​final /​s/​, and
Russian for obstruents in general (Jones and Ward 1969 a.o.). Greek and
Spanish have very few word-​final obstruents, while Russian has more; this is
probably the only reason the Greek and Spanish cases appear to be limited to
a single segment.
Another commonly encountered interword process is resyllabification
between the final consonant of word A and vowel-​initial syllable of word B,
and any concomitant effects resulting from the newly syllable-​initial (or per-
haps ambisyllabic) position of the consonant. Such processes are generally
thought to be due to the universal preference for syllables with onsets and for
syllables without codas. For instance, English optionally resyllabifies in a case
like the one in (3). If the resyllabified consonant is /​t/​or /​d/​, it then flaps.

(3) pit. ə.ɹajvd → pi.tə.ɹajvd → pi.ɾə.ɹajvd

Pete arrived

The exact description of the environment for flapping is, of course, the sub-
ject of continuing controversy, but the idea that it is related to resyllabification
is not. Similar examples can be found in dozens of languages, including
Spanish and French (despite the synchronically deeply complicated status of

230 Ellen M. Kaisse

Place assimilation is also commonly encountered. Nasals in Spanish are
well known to take on the place of articulation of a following consonant,
whether it is in the same word or the next one (Navarro Tomás 1965 a.o.), and
English consonants do the same, though the extent of this optional process,
which can sometimes include velar and labial consonants as targets as well as
alveolar ones, is only now being discovered through the investigation of large
corpora (Coleman et al. 2016).
When one word ends in a vowel and the next word begins with one, vowel
deletion, gliding, or coalescence are often optional or obligatory, depending
on the language and dialect, the speech rate and style, and so on. Modern
Greek (Kaisse 1977 a.o.), Spanish (Navarro Tomás 1965 a.o.), and many other
languages have been thoroughly described in this regard.
Many languages exhibit the spread of nasality, continuancy, or voicing
after a nasal. Korean is famous for the first (Jun 1996 a.o.). Modern Greek
(Arvaniti 1999 a.o.) spreads voicing from a word-​final nasal onto a word-​
initial stop; it also spreads non-​continuancy from a word-​final nasal onto
a word-​initial voiced fricative. And voicing spread can be more general.
Korean (Jun 1998) voices unaspirated stops between sonorants, and this
assimilation is sensitive only to prosodic phrasing, not to the boundaries
between words. Postlexical continuancy spread is found in Spanish, if one
treats the alternation of voiced stops and fricatives as the result of spread
of +continuant from vowels and –​continuant from nasals, as does Harris
(1985), for instance.
Finally, we encounter an assortment of other, less common local segmental
processes, such as Korean post-​obstruent tensing (Jun 1998) or Italian syn-
tactic doubling (N&V 1986). And there is a small collection of rules that adjust
the length of vowels in response to their position in a phrase or the length of
other vowels, as exemplified in Chimwiini (Kisseberth and Abasheikh 1974),
Xitsonga (Cassimjee and Kisseberth 1998), and Luganda (Hyman et al.
1987). I will not treat these vowel length adjustments here. Some of them seem
related to rhythmic factors not unlike the rhythm rules we will mention in
Section, while others may be phonologizations of the cross-​linguistic
tendency toward word-​final or phrase-​final lengthening.
Summarizing so far, the great majority of postlexical processes are
local assimilations of place, manner, or voicing, or they are deletions or
resyllabifications that serve to optimize syllable structure; both kinds apply
between adjacent segments. While some of them may be fully phonologized,
most are natural processes that apply with greater frequency in casual,
unmonitored speech. In other words, they have the typical markers ascribed to
postlexical rules within the theory of lexical phonology. They are phonetically
natural, optional, dependent on rate or style or both, and generally are not too
far from their phonetic precursors. To be sure, some, such as French Liaison or
Italian Syntactic Doubling, have become thoroughly grammaticalized and are
dependent on factors involving syntactic relations or the prosodic groupings
that are nearly isomorphic to syntax, and they may also make reference to

What kinds of processes are postlexical? 231

morphological categories, but such total grammaticalizations seem far less
common than processes that closely hug the phonetic ground.

8.3 Other types of local postlexical processes

While the kinds of very local processes listed in the previous section seem to
be the most frequently attested postlexical rules, there are certainly additional
types. The most commonly encountered of these are tonal processes, including
Chinese tone sandhis, and tone spread, deletions, and reassociations in various
tone languages, especially Bantu. We will discuss these further in Section 8.5.
At least some of these tonal processes are quite powerful. They can penetrate
more than one segment or syllable into an adjacent word or words. But what
of the other kinds of processes that we are used to encountering within words,
iterative ones that affect more than one segment? These include vowel and
consonant harmony (including the spread of coronality, nasality, pharyngeal-
ization, and other consonantal features), and stress assignment, which foots
entire words as exhaustively as possible. Can word-​internal long-​distance
processes such as these also apply in a long-​distance fashion between words?
At first blush, one might expect that they would, even more so than local
rules, since their iterativity within words makes them seem more powerful in
their domains. But in fact they are rare as postlexical processes, and usually
only extend beyond the closest syllable of the adjacent word when they escape
the word.

8.3.1 Vowel harmony

Most of the cases of vowel harmony described in the literature are word
bounded, or at least their descriptions do not mention any leakage of har-
mony outside of the word. To illustrate the furthest beyond the morphologic-
ally proscribed word a harmony rule generally goes, consider the extent of
vowel harmony in Turkish (Lewis 2000 a.o.) and in Pasiego (Penny 1969).
In Turkish, backness harmony takes in clitics, as shown in (4), where all the
vowels are back, and (5), where they are all front (examples mine).

(4) ʧoʤuk-​lar-​ɨm=lɑ=mɨ=dɨr
‘is it with my children?’

(5) ʧiʧek-​ler-​im=le=mi=dir
‘is it with my flowers?’

In Pasiego (a co-​dialect of Spanish spoken in the northwestern section

of Spain), vowel harmony also takes in slightly more independent function
words, such as articles and object pronouns. Example (6) shows that Pasiego

232 Ellen M. Kaisse

height harmony spreads [+hi] from a stressed vowel backward to underlying
mid vowels. In (7) we see that the preposition /​po/​‘along’ and the determiner
/​el/​appear with mid vowels before a content word whose stressed vowel is
not high. However, in (8), we can see that both these function words undergo
raising before the stressed high vowel of [kɐˈmɩnʊ] ‘path’. (Low vowels are
transparent to harmony.)

(6) beˈber bibiˈria /​beb-​/​

drink-​INF drink-​COND

(7) po la ˈkale el ˈpan /​po/​, /​el/​

along the street the bread

(8) pʊ ɩl kɐˈmɩnʊ
along the path

In these two languages, harmony does not apply between the members of a
compound word, let alone between independent content words, and this gener-
ally seems to be the case in those languages I have surveyed. However, Tibetan
(Dawson 1980) is reported to have productive Advanced Tongue Root (ATR)
height harmony within compounds. My speculation, which will find further
support in the independent word cases below, is that because compounds are
fixed, lexicalized phrases –​some more so than others, of course –​and there-
fore contain words that are in frequent collocation, they provide the next most
hospitable environment for vowel harmony to be phonologized.
An interesting case of a statistical tendency toward vowel harmony within
compounds in Turkish was discovered by Martin (2007). More Turkish
compounds have harmony-​obeying members than would be expected from a
chance distribution. Based on this case and several others, Martin hypothesizes
that when speakers retrieve words (including compound words), there is a
statistical preference to retrieve those that accord with the phonotactic
generalizations present in the language. Therefore, more Turkish compounds
that accord with vowel harmony are retrieved and subsequently lexicalized.
Such distributions can only be found easily with tools earlier generations
of linguists did not have, since counting compound words in the lexicon is
prohibitively time-​consuming. In any case, it seems to be the case that these
generalizations about compounds are very rarely phonologized. We may find
many more statistically imbalanced cases like Martin’s, but I do not think we
are faced with significant underreporting of obligatory or even optional vowel
harmony within compounds in the world’s languages. Akan, a representative case of limited postlexical harmony

So most vowel harmony seems to be restricted to the grammatical word plus,
possibly, adjacent clitics or function words. But are there any cases reported

What kinds of processes are postlexical? 233

of vowel harmony beyond the prosodic word? I have thus far encountered
only about half a dozen instances where the vocalic features of one content
word spread onto those of an adjacent and syntactically independent con-
tent word, as opposed to a member of a compound. In four of these cases,
the effect is felt only one syllable to the left, and is optional. Akan (Dolphyne
1988; Kügler 2015) will serve as a representative example. Akan is a Kwa lan-
guage of Ghana. It has a neutralizing, ATR-​spreading vowel harmony, which,
in addition to applying throughout a word, also occurs within a phonological
phrase between a word with [–​ATR] vowels followed by a word with [+ATR]
vowels. [+ATR] spreads regressively across the word boundary to the imme-
diately preceding syllable. The process is limited to the final syllable of the
[–​ATR] word and thus results in disharmonic words on the surface. In the
following example, the last vowel of /​ɔ̀pὲ/​‘s/​he likes’ is realized with +ATR [e]‌.

(9) ɔ̀pὲ sìká → ɔ̀pè sìká money
‘s/​he likes money.’ Dolphyne, 1988, p. 24

Some characteristics to notice in this process, because they recur in other

cross-​word vowel harmonies, are the following:

• The process is bounded by the phonological phrase (however that is

mapped in the particular language).
• The process is only regressive, irrespective of the bidirectionality of its
word internal counterpart.
• The process can only affect one vowel, the one closest to the right edge
of the affected word, despite the iterative application found within words.

While the process within Akan words is iterative and bidirectional (spreading
+ATR from stems to prefixes and suffixes), the one between words is local
and regressive. In other words, the postlexical process is less powerful than its
lexical counterpart. Cases similar to Akan

A similar process occurs in another Kwa language, Gwa Nmle (Obenga 1995).
The domain is not a word but some larger phrasal unit, which Obenga does
not fully define. One of these phrasal units is subject plus verb. In this collo-
cation a [+ATR] vowel in the verb advances a [-​ATR] final vowel of the pre-
ceding subject noun, but, as in Akan, only one syllable in.
For another case where harmony extends only one syllable to the left in
cross-​word environments, consider Vata, a Kru language of Côte d’Ivoire
(Kaye 1982: 122, 126). In Vata, an ATR-​based harmony extends leftward
optionally into the final syllable of a content word. In (10) below, the mono-
syllabic noun ‘food’ optionally has its vowel advanced under the influence of

234 Ellen M. Kaisse

the following [i]‌in [pì], while in (11), the disyllabic noun optionally undergoes
advancement of only its final vowel.3

(10) ɔ̋ ní   zɑ̄ ~ zʌ̄ pì

He NEG food cook

(11) ɔ̋ ní sɑ̋kɑ́ ~ sɑ̋kʌ́ (*sʌ̋kʌ́) pì

He NEG rice cook

This state of affairs is clearly reminiscent of Akan, where harmony extends

one syllable into a polysyllabic content word. More long-​distance processes in Nawuri and Kinande

We have now seen two languages of the Kwa family, and one of the Kru family,
where postlexical harmony shows the interesting one-​syllable-​leftward syndrome.
However, not all Kwa languages show this particular circumscribed behavior.
Nawuri, a language of Eastern Ghana, has unbounded regressive vowel har-
mony across word boundaries (Casali 2002: 25ff, also quoted in Kügler 2015).
In the Nawuri example below, all the syllables of the first word become [+ATR].

(12) /​ɛ-​kɔɔlɪɑ fulee /​  → [èkóólɑ̘ ́ ɑ̘ ́ fùléèʔ]

prog-​he.receive nc-​money ‘he is collecting money’

However, the interword process is not as regular or phonologized as its

intraword counterpart. Casali says it is variable and dependent on rate and
style, and assimilation is only partial, resulting in vowels of a quality inter-
mediate between the canonical +ATR and –​ATR phonemes. Note that this
process too is only regressive.
So it seems that cross-​word versions of vowel harmony can develop, but it
is difficult for them to fully grammaticize. They may remain gradient, partial,
and optional. I would speculate that the reason is that content words, unlike
affixes and function words, do not have the opportunity to be in frequent col-
location with other content words. (For discussion of the influence of repeti-
tion on grammaticization, see Bybee 2006 a.o.)
The Kinande case we now turn to underlines the connection between
repetition and grammaticization. In Kinande, ATR vowel harmony option-
ally extends between content words in noun-​plus-​adjective phrases except in
very deliberate speech. But, as is typical of Bantu languages, there are only
about 20 adjectives in the language, making adjectives a closed class and this
kind of collocation something akin to a fixed phrase that is often repeated.
Frequent collocation seems to favor postlexical application, in a way remin-
iscent of the more familiar extension to clitics and function words in cases
like Pasiego and Turkish. We will see that harmony may also occur between
a verb and its object, but it is slightly weaker even than in noun + adjective

What kinds of processes are postlexical? 235

combination, only affecting one syllable of the verb (Mutaka 1995; personal
communication 2015).
Kinande is a Bantu language of the Democratic Republic of the Congo
(Mutaka 1995; Archangeli and Pulleyblank 2002). ATR spreads obligatorily
within words. Harmony is bidirectional within words, but it goes only leftward
when it applies between words. The postlexical variant is optional and phon-
etically gradient both in the number of syllables affected and in the amount
the harmonized vowels are advanced. The most common combination in
which the interword extension is found is the strings of noun+adjective, which
are limited by the small number of adjectives in the language. In (13), we see
that the +ATR vowels of the adjective can spread their value backward not
at all, one syllable leftward into the last syllable of the noun, two syllables in,
or throughout the noun. Mutaka’s intuition is that the two-syllable spread
(bolded below) seems the most likely in connected, unguarded speech. But all
the renditions are grammatical.

(13) /​ɔmʊtɪ mukuhi/​  → [ɔmʊtɪ mukuhi] (deliberate) ~ [ɔmʊti mukuhi] ~

tree short    [ɔmuti mukuhi] ~ [omuti mukuhi]

Archangeli and Pulleyblank (2002) note that the output of the rule is also
gradient in the sense that the leftmost derived +ATR vowel is not quite as
advanced as a canonically +ATR vowel would be.
Mutaka (personal communication 2015) also intuits that the rule can
extend onto the nearest non-​low vowel in a verb when the following noun
object is +ATR. In this case the rule is again optional and would not apply
in very deliberate speech. The spread seems to be slightly more limited than
in the noun + adjective cases. Thus in (14), the /​ʊ/​of the verb stem /​sʊŋ/​ can
be realized as [suŋ] under the influence of the following +ATR noun, but no
advancement occurs on the preceding syllables [mɔ.tʊ.ka].4

(14) /​mɔ-​tʊ-​ka-​sʊŋ-​a         ɔ-​mu-​kali/​ → [mɔ́tʊkásuŋ omúkali]

TM-​1ppl-​TM-​find-​FV Aug-​C1-​woman   
‘we found the woman (a short time ago)’

So for a fourth time, in the combination of fully open-​class content words,

we find regressive assimilation, optionality, penetration only one eligible vowel
into word A of the word A + B string, and gradience. The noun+adjective
collocation may be a bit ahead in the extension to content words and under-
goes a slightly more powerful version of this gradient process, but does not
seem to have been fully phonologized either. More investigation is clearly
called for beyond the little that Mutaka and I have done to push forward
the observations of Archangeli and Pulleyblank (2002), but the basic outline
seems to accord with the expectations one would form from the observation
of Akan, Gwa Nmle, and Vata.

236 Ellen M. Kaisse Powerful interword vowel harmony?
While the infrequent collocation of content words should result in a typ-
ology where complete and grammaticized vowel harmony between open-​class
words is very rare, there is probably nothing that in principle prevents such
a situation from developing now and then. This appears to be nearly what
is going on in one dialect of Somali, though other dialects apparently have
less powerful harmony extensions. Andrzejewski (1955) reports that the Isaaq
dialect has ATR harmony across entire sentences so long as there is no pause.
But even this case still has the hallmarks of a postlexical version of a pro-
cess that has not been fully phonologized. He reports idiolectal variation and
more harmony at faster rates of speech or with fewer pauses while the within-​
word version is obligatory and regular. Moreover, Andrzejewski’s results have
not been found by all researchers for all dialects. They were not replicated by
Odden (1980), who looked explicitly but found no harmony between words,5
nor were they found by Armstrong (1934), who reports only some familiar
clitic-​like extensions to monosyllabic function words.
The Isaaq Somali case is, so far, the only nearly unambiguously long-​distance
and nearly grammaticized vowel harmony rule I have encountered. Before we
leave this section, however, I should dispose of a case that sounds, from the title
of the article describing it, like it might be an even more powerful instance of
long-​distance harmony than Isaaq Somali. This is Wolof, an Atlantic-​Congo
language of Senegal, described by Sy (2005) as having “ultra long-​distance
ATR agreement” in an article of the same name. However, what Sy describes
is not a phonological rule. In Wolof, a determiner agrees in ATR value with its
head, and an object clitic pronoun agrees in ATR value with its verb. Crucially,
they may be separated by a Noun Phrase (NP) or clause that disagrees in ATR
value with them. Evidently, this is morphologization of ATR as an agreement
marker. Presumably it stems from the familiar (phonological) extension of
word-​internal harmony to include adjacent function words and clitics. When
the closed-​class function words and clitics were separated from their hosts by
phrases, the ATR agreement was reinterpreted as an agreement marker. Why so rare?

Why should a word-​bounded version of a process typically be powerful,
extending to all vowels in the word, but weak between words, rarely making
it out of the word and, if making it out, not remaining iterative but extending
only one syllable or so?
One reason may be that there is no robust long-​distance phonetic pre-
cursor for cross-​word harmony. The pioneering work of Öhmann (1966)
showed that many languages without phonologized harmony do have vowel-​
to-​vowel coarticulation. The effect peters out the further one gets from the
source (McPherson and Hayes 2016; Kimper 2011a). Therefore, learners will
not have good exemplars to suggest that the vowels of one word should agree

What kinds of processes are postlexical? 237

in features with another word –​they are often temporally too far away for any
strong articulatory carryover.6
This lack of a strong precursor will be magnified by an effect we have
already mentioned. Unlike bound morphemes or clitics or function words,
content words do not frequently co-​occur with other content words. So the
phonologization of any small effect becomes unlikely. There are just not
enough exemplars.7
A third, intertwined reason for why we find so few documented cases
of interword vowel harmony may be underreporting. If vowel harmony
rarely phonologizes between content words and its effects are typically non-​
neutralizing, variable, and gradient in such cases, investigators may not direct
their attention to, notice, or mention interword harmony as often. Indeed, in
the history of phonological description, attention has long been directed pri-
marily to within-​word processes. As we know from the theory of lexical phon-
ology, these processes tend to be neutralizing and regular (though they may
have scattered lexical exceptions.) While attention to interword processes has
certainly increased in the last few decades, this imbalance has not been fully
Finally, we have the sorts of explanations that might be best seen as
members of the Optimality Theoretic (OT) families of positional faithful-
ness constraints (Beckman 2004) such as F a i t h R o o t : root specifications
are resistant to change, and F a i t h - σ​ 1: initial syllables are resistant to
change. Informally, this means that the information contained in a root or
in an initial syllable is particularly important and tends to be preserved. If
a vowel harmony process that was neutralizing extended all the way into a
content word, the identity both of its root vowel and its initial vowel would
be neutralized with the underlying vowels that differ from them in the har-
monizing feature.
Interesting work by Shih (2014, 2016) is reminiscent of these phonological
effects. Shih analyzes phonological influences on syntactic patterns such as
word order changes and suppletion. She argues that just as phonology can
have effects on morphological choices, such as allomorphy, it can increase
the likelihood of certain word order choices and paraphrases. However, the
effects that ultimately result in allomorphy within the word are much stronger
and more common than those that govern choices within phrases, and they
are grammaticized.
Given the number of cases I have thus far found, it may be premature to
speculate on why the regressive direction seems to predominate when har-
mony extends one syllable backward from one content word to another. But
it does seem like a gradient version of Faith-​σ1 may be at issue here. Beth
Hume (personal communication 2015) has pointed out to me that the final
syllable of a content word contains less information than its predecessor; it is
more predictable, based on the syllables that have come before. Therefore, if
any root vowel is to be affected without serious loss of information, it should
be one of the last ones. A neutralizing progressive vowel harmony from word

238 Ellen M. Kaisse

A to word B, which I have not found so far, would eliminate vocalic infor-
mation at the beginning of word B, information that is less predictable and
therefore more valuable.

8.4 Vowel harmony is typical of postlexical processes: Consonant

harmony; stress
The behavior of postlexical vowel harmony leads one to wonder whether it
is representative of other long-​distance processes that iterate within words.
I believe that vowel harmony does indeed represent the norm.
Two other kinds of processes that iterate throughout words are consonant
harmony and the initial footing that determines primary and secondary
stresses within a word. These also tend very strongly to being word-​bounded.
Hansson (2010) surveyed over one hundred cases of consonant harmony,
greatly extending the number of cases that had previously been brought
together in the literature. He includes any case where the features of a con-
sonant assimilate to another from which it is separated by a vowel, and
includes examples involving voicing, stricture, nasality, uvularity, secondary
articulations, and rhoticity. Yet he found no postlexical cases. This restriction
is all the more interesting because Hansson postulates that the precursor to
harmony is sentence planning and its associated errors, since they have many
traits in common. Sentence planning does extend beyond the word. We must
conclude that it takes more than a precursor to result in the phonologization
of a process so that it will apply between words. Just as vowel harmony has
vowel-​to-​vowel coarticulation as a precursor but rarely goes into an adjacent
word and then only one syllable, consonant harmony’s precursor does not
assure postlexical application.8
As with vowel harmony, it is possible that some of the absence of
documented interword cases comes from underreporting in the literature. For
instance Sharon Hargus (personal communication 2015) finds that Sahaptin
sibilant harmony does sometimes apply outside the word. But it is vari-
able and there are not enough texts from the available speaker to nail down
what the precise description should be. Similarly, Martin (2007) found that
Navaho sibilant harmony is statistically overrepresented in compounds, just
as he found that vowel harmony is statistically overrepresented in Turkish
compounds. It is possible that if enough researchers went out looking for
postlexical consonant harmony or for fully phonologized harmony within
compound words in the world’s languages, some additional cases would be
unearthed –​and indeed I have certainly not done an exhaustive search for
such cases. One might guess, however, that, like vowel harmony examples, the
great majority would be variable and not completely phonologized.

8.4.1 Stress: Foot building processes

A perusal of a compendious work on stress rules, such as Hayes (1995),
confirms the impression that the foot-​building processes that initially assign

What kinds of processes are postlexical? 239

primary and secondary stresses are uniformly word-​bounded. While many
iterate so as to fully foot the word, they do not extend into the next content
word. They usually encompass a root and any cohering affixes, that is, affixes
that are phonologically interactive with the root. (Indeed, one typical diag-
nostic for cohering vs. non-​cohering affixes is that only the former affect the
syllable count on which stress assignment is based.) Of course, there are caveats
here. Languages do have rhythm rules, which take the basic prominences
assigned within a word and adjust them when words are put into phrases,
so as to avoid clashes of nearby primary stresses and achieve a pattern that
is closer to a perfect alternation of strong and weak. But these adjustments
presume an initial word-​bounded parse. They do not assign or move the main
stress of the phrase, place stress on a previously unstressed syllable or undo
footing. Another caveat is one we have already discussed for vowel harmony.
Sometimes closed class, function words or clitics may also be counted for the
purposes of stress; some languages’ cohering suffixes may include not just the
morphological word but may also extend slightly into the (usually stressless)
more syntactically independent words around them. But the point remains
that they do not extend into other content words. Modern Greek provides a
good example of a typical almost-​word-​bounded stress system, showing that
clitics can be taken into the syllable count. It has a three-​syllable window for
stress, counting from the end of the word. If a noun with antepenultimate
stress is followed by a possessive pronoun, an additional, final stress is added,
suggesting cyclic stress that applies on the internal word domain and also on
the word plus clitic domain.

(15) to aftoˈkinito to aftoˌkiniˈto=mu

def automobile def automobile=1 poss
‘the automobile’ ‘my automobile’

But word plus clitic seems to be about as big as the domain gets. Foot con-
struction does not cross content word boundaries. We don’t expect to find
cases like the fanciful examples below from a language like English, but where
trochaic feet are built from left to right, taking in whole phrases, and thereby
creating wholesale allomorphy in content words depending on the length and
stress pattern of surrounding content words:

(16) (x .) (x .) ‘other animals’

ˈʌδəɹ ˈænəməlz

(17) (x .) (x .) ‘big animals’

ˈbɪg ə ˈnɪməlz

While the typical state of affairs is as we have sketched it here, there is an

exception, well-​known among Slavicists, that proves the rule. Macedonian as
described most compendiously by Franks (1987) can build feet over certain
units larger than the word. But the units, termed “enlarged stress domains”,

240 Ellen M. Kaisse

are those that are in frequent collocation. The situation is complicated and
also varies by geographical region, but generally the domains are nominal
modifiers and their head nouns, but only when those modifiers are found
in an idiomatic or “set” use, as in ‘old man’ or ‘dry grapes’ (in the sense of
‘raisins’); closed class nominal modifiers such as determiners, interrogatives
and possessive pronouns plus their head nouns; the closed class of numbers
plus a following noun; some prepositions (also a closed class) when they are
followed by pronouns; some prepositions and their nominal objects, but only
in idiomatic uses such as ‘(to) home’ or ‘by force’; and clitics. I discuss this
case in more detail in Kaisse (2017).
Our next question, then, is why is stress assignment almost always word-​
bounded? Here are my speculations. The first is one we have already advanced
for vowel and consonant harmony. Rhythm is grammaticized as alternating
stresses when certain syllables are heard together over and over. This really
only happens within words, including their affixes, and, occasionally, with
words plus slightly more independent items such as clitics. The second is par-
allel to the functional explanation we raised for vowel harmony. For harmony,
extending the copying of features into independent words would wipe out the
distinctive characteristics of root morphemes. In stress systems, extending the
building of feet as in our fanciful examples in (16) and (17) would eliminate
the demarcative function linguists often ascribe to stress. Because stress is
culminative, with exactly one primary stress syllable per word, it allows us to
identify the independent lexical words of a phrase. Because stress is often pre-
dictably located on an initial, final, or penultimate syllable, it also allows us
to identify word boundaries, helping the listener locate the end of one word
and the beginning of the next. I doubt that the absence of descriptions of foot
building algorithms that ignore word boundaries is due to underreporting.
In general, then, it appears that iterative rules do not iterate beyond the
word –​they are almost never postlexical and when they are, as in the case of
vowel harmony, they are weakly so, extending only optionally or gradiently
into an adjacent syllable, and very rarely into multiple syllables or multiple
words. The only true exception to this generalization seems to be tone rules,
but we will see in the next section that even the postlexical instantiations of
lexical tone rules can exhibit the same sort of single-​syllable restriction as we
have noted for vowel harmony. It appears that the precursors of phonological
rules are truly local. Becoming iterative requires phonologization of a process,
and phonologization requires frequent collocation of morphemes.

8.5 What is special about tone and pitch?

Tonal processes are widely represented, indeed over-​represented in the litera-
ture on postlexical phonology, particularly on phonology that extends many
syllables into an adjacent word or which can occasionally even span more
than two adjacent words. Intonational rules, which are conceptualized in most
modern treatments as processes that distribute non-​lexical pitch autosegments,

What kinds of processes are postlexical? 241

are also frequently encountered in such literature. To make this concrete, let us
take a quick look at three major collections of papers about interword phon-
ology that have appeared in the last three decades. These collections have a
disproportionate number of contributions about tonal Bantu languages in par-
ticular, and tone languages in general. Many of the papers that are not about
tone are about intonation. The special issue of Phonology Yearbook 4: Syntactic
conditions on phonological rules (1987) contains eleven papers, of which two
deal with Bantu tone rules, and four more deal with tonal or intonational
phonology in non-​Bantu languages. Inkelas and Zec (1990) has no fewer than
seven out of nineteen papers that deal with tone in Bantu languages, plus two
on tone in Chinese. Finally, Phonology 31.1: Constituency in sentence phon-
ology (2015) contains six papers in total. Three are on tone in various Bantu
languages, one is about pitch reset in a Basque dialect with pitch accent, and
one is on downstep in the intonational system of German. The only one not
about pitch autosegments is the Kügler paper on Akan that we discussed in
section Pitch autosegments, be they used for lexical or morphological
tonal contrasts, for lexical pitch accents or for intonational melodies, appear
again and again when linguists talk about postlexical rules. Approaching the
question from the other direction, namely the characteristics of tone languages,
Hyman (2011 a.o.) documents many cases of tonal rules that are truly long dis-
tance, and many cases of tone rules used to mark morphological and syntactic
information in ways we simply do not find for segmental rules. In Hyman’s
amply documented view, tone really is different. While it can do everything that
segmental features can, it has properties that other features do not. “Tone is the
autosegment par excellence” (Hyman 2011: 238). By this, he means that tone is
particularly subject to mobility (not ending up where it starts out underlyingly)
and stability (failure to delete along with its tone-​bearing unit); that it is more
likely to float, unanchored to any particular tone bearing unit; to be employed
as a lexical or grammatical marker; and to interact with other tonal units at a
distance. It is therefore not surprising that we find tone (and intonation) dom-
inating collections dedicated to phonological processes that are sensitive to syn-
tactic constituency. Really, these are the only kinds of processes that involve
features mobile enough to regularly spread well into adjacent words and even
into non-​adjacent words. The spread is often bounded by prosodic phrases that
are isomorphic or nearly isomorphic with syntactic phrases. To anticipate a
case we will come back to later, consider a representative example from the
Bantu language Xitsonga (Cassimjee and Kisseberth 1998 a.o.) In (18), the
High tone from the verbal prefix /​vá/​has spread over seven syllables, extending
through all but the phrase-​final syllable of the second word.

(18) vá-​xava xihlambetwa:na → [váxává

3ppl-​buy pot
‘they are buying a pot Cassimjee and
Kisseberth ex. (23)

242 Ellen M. Kaisse

David Odden (personal communication 2015) has suggested several
reasons why tone may act in this special way in Bantu. Tone bears a low func-
tional load in Bantu, which has relatively sparse underlying High tones and
only a two-​way contrast in tone; Low usually does not need to be marked.
Spread or displacement of High does not dislodge a contrastive tone, as it
does, for instance, in Chinese tone sandhi. There also may be something like
the opposite of the underreporting phenomenon we considered for vowel
and consonant harmonies. Long distance High tone spread has been noted
repeatedly in many foundational studies of particular Bantu languages, so it
is something Bantuists expect to find. Even when its language-​specific instan-
tiation is quite subtle, a Bantu phonologist is likely to discover her language’s
particular version of High Tone Spread. Finally, tone can be perceptually
rather hard to locate precisely in a Bantu word or phrase, allowing for the
development of grammatical rules for spread or for docking on tone-​bearing
units where it does not originate lexically. In an utterance with a series of
underlying Highs and not-​Highs, pitch transitions are often very gradual,
involving interpolation, downdrift, and so on –​quite unlike the situation in
Asian tone languages, where rapid tonal transitions are exploited to make
distinctive contours. Instead, for example, an underlyingly toneless syllable
that might be expected to be realized at a low pitch can emerge at a high
pitch if it immediately follows a High. Additionally, the pitch difference
between distinctively high and low pitches may not be very great in many
Bantu languages.
Hyman (2011) leads us to realize that tone, as the “autosegment par excel-
lence” may be specially suited for the delineation of syntactic boundaries (and
thus to be frequently encountered in inter-​word phonological processes). This
is probably also why it is used to mark boundaries in intonation. Languages
very often display phrase tones and boundary tones; pitch declination is often
limited to phrases and pitch reset typically occurs between the edge of one
prosodic constituent and the beginning of another. One cannot really imagine
a feature used for vowel harmony, say, used as an intonational contour.

8.5.1 Local postlexical tone spread

The immediately preceding discussion might have led the reader to believe
that the tonal processes that escape words are always powerfully long distance,
and indeed we will look at several more long-​distance cases shortly. But before
we turn back to these quintessential long distance postlexical tone rules, we
should note that there are also postlexical tone rules that are limited in their
domains, not unlike the vowel harmony cases we considered in section 8.3.1. Copperbelt Bemba

In a series of papers, Bickmore and Kula (2013; Kula and Bickmore 2015)
have described the lexical and postlexical tonal phonology of Copperbelt

What kinds of processes are postlexical? 243

Bemba, a Bantu language of Zambia. The language has a process which they
name Unbounded H Spread. It applies to any number of toneless syllables
within a word that are preceded by a phrase-​final high tone, and makes them
all high.

(19) /​bá-​ka-​mu-​londolol-​a/​→ [bá-​ká-​mú-​lóóndólól-​á]

‘they will introduce him/​her’

The process does not generally cross word boundaries, as shown in (20),
where, since Unbounded Tone Spread is inapplicable due to the first word not
being phrase-​final, a rule of Bounded Tone Spread applies exactly two tone-​
bearing units rightward.

(20) /​bá-​ka-​salul-​a buino/​→ [bá-​ká-​sálùl-​à bwììnò]

‘they will fry well’

However, there is a high tone spread rule which does apply between words.
Bickmore and Kula name this Inter-​ Word Doubling or Binary Spread.
It spreads a word-​final high tone onto the initial tone bearing unit of a
following word.

(21) /​páapaatik-​il-​á kapembuá/​→ [páápáàtìk-​ìl-​á kápèèmbwá]

‘flatten for Kapembwa’

My interpretation is that Inter-​ Word Doubling is a weak postlexical

version of Unbounded High Spread. Unbounded High Spread is iterative
and unbounded as far as the number of toneless syllables it can affect, but it
cannot escape the word. Inter-​Word Doubling, on the other hand, is remin-
iscent of the postlexical vowel harmony which permeates one syllable into a
content word in Akan and the other languages we looked at earlier.
Notice that the tonal spreading in Copperbelt Bemba is rightward onto
the following word, while we have seen that the tendency in vowel harmony
in our sample is leftward. Hyman (2011, citing Kingston 2003) attributes the
predominance of perseverative tone spread within words to its phonetic pre-
cursor, the tendency of tones to be realized late within their segmental targets.
(However, he notes that between words, tonal spread can be found to act
regressively as well as progressively.) Local Chinese tone sandhi

The Copperbelt Bemba case suggests that the tendency for postlexical
processes to be local and weak compared to their within-​word counterparts
applies to tonal processes as well as non-​tonal ones. This notion is strongly

244 Ellen M. Kaisse

reinforced by the compendious description of tone sandhi processes in
Chinese dialects carried out by Chen (2000). Chen (pp. 101, 105ff) concludes
that Chinese tone sandhi, while very diverse, is typically restricted to a two-​
syllable stress foot. As in Copperbelt Bemba Inter-​word Doubling, tone in
such cases only spreads or influences one adjacent syllable, the weak sister in
the foot. Chinese tone sandhi processes may appear to be non-​local, but usu-
ally not in the sense that a single tone affects non-​adjacent tones. Rather, tone
sandhi can iterate locally. Once a string of two syntactically closely bound
words in a stress foot has been adjusted, the output may then feed into the
adjustment of another local sequence, as in Tianjin (Chen 2000: 105ff). The
algorithm for Tianjin is notoriously complicated, but it’s clear that it works
on pairs of adjacent syllables, often feeding another process that then applies
between two overlapping syllables. Thus, in the example in (22) a tonal dis-
similation process Chen calls True OCP changes the L on [wen] to Rising
under the influence of the following L on [bei]. This then feeds a process
dissimilating contours so that the newly created R R sequence on [bao wen],
changes the tone on [bao] to H.

(22) bao wen bei ‘thermos cup’

UR R   L L Chen p. 107
True OCP R  R    L
OCP on Contours H   R    L

While more than one tone in a sequence has ultimately been changed, the
influence is only from one tone to an adjacent tone. And in many cases of
tone sandhi in the Chinese languages treated by Chen, the effect is purely local
within the foot, with no feeding as in the Tianjin case.
While, as we shall see in the next section, tonal processes can grammaticize
so as to span several syllables or even several words, Copperbelt Bemba and
many Chinese tone sandhi processes indicate that such powerful application
is not universal, nor perhaps even typical.

8.5.2 Long distance tone spread

As we have already noted, Bantu languages are particularly well-​represented
in the literature on postlexical rules, particularly cases involving the tone
of one morpheme spreading throughout a word and on onto another word
within a phonological phrase. The affected words are often polysyllabic con-
tent words. One frequently cited example comes from Xitsonga, a Southern
Bantu language of South Africa. The language has High tones (marked) and
Low (unmarked) tones. H Spread (Cassimjee and Kisseberth 1998; Selkirk
2011) spreads the first High tone in a phrase rightward all the way to the end
of the phonological phrase. Thus, in (23) below, all the underlying Low tones
in ‘bringing a giant’ appear low on the surface following the low 1ppl prefix
[hi-​]. However, in (24), the High-​toned 3p pl prefix [vá-​] causes all of the sub-
sequent tones to become High.

What kinds of processes are postlexical? 245

(23) [hi-​tisa xi-​hontlovi:la]

1pl bring CL-​giant
‘We are bringing a giant’

(24) [vá-​tísá xí-​hóntlóví:la]

3pl bring CL-​giant
‘They are bringing a giant’

The many Bantu cases in the literature of which I am aware are phrase-​
bounded –​that is, they apply in a prosodic domain that is close to isomorphic
with a syntactic phrase: between a verb and its object for instance (Luganda
Low Tone Deletion, Hyman et al. 1987), any head and its complement
(Xitsonga), and so forth. As such, the examples one finds are typically only
two words long, though they may involve many syllables. Here are some of
the many Bantu examples one could mention: in Shambala, a Southern Bantu
language of Tanzania (Philippson 1998) High tone spreads to the penult of
the following word, regardless of its length. In Tiriki, a Southern Bantu lan-
guage of Kenya (Paster and Kim 2011), High tone spreads to all the toneless
tone-​bearing units of the preceding word. In Logoori, yet another Southern
Bantu language of Kenya (M. Paster and D. Odden personal communication
2015; fieldwork in progress), High tone spreads similarly to the way it spreads
in Xitsonga, but the phonetic repercussions in Logoori are not easy to hear
if one does not know to look for them. Here we have the inverse of under-​
reporting, in the sense that Bantuists know from comparative tonology within
the family that there is likely to be some reflex of tonal spread in the language
they are investigating and therefore are able to find subtle cases that a non-​
Bantuist might fail to recognize.
Turning to cases that are not from languages even distantly related to
Bantu, let us look at a low tone deletion rule in Peñoles Mixtec (Otomanguean;
Daly and Hyman 2007). This is a fairly spectacular case because it can involve
tonal triggers and targets in non-​adjacent words. Daly and Hyman argue that
Peñoles Mixtec has underlying L and H tones, while the third tone, often
treated as underlying Mid, is really a default Ø tone that is invisible to phono-
logical processes. The tone deletion rule, called OCP(L), deletes the second L
of a L-​Ø*-​L sequence (where Ø* indicates any number of toneless syllables)
sometimes across many words and many syllables. Daly and Hyman’s longest
example (their (13b); (25) below) shows twelve intervening toneless tone-​
bearing units between the L’s. There are three words –​including an inflected
verb –​between the word containing the trigger L (the first syllable of [dìi-​ni-​
kʷe-​ʃi]) and the word containing the target L, [tʃìu]. The syllables containing
the trigger and target are underlined below.

(25) /​ɨɨ̀ Ndìi-​ni-​kʷe-​ʃi kada-​kʷe-​ʃi ɨɨN ɨɨN tʃìu /​→

[ɨɨ̀ Ndìi-​ni-​kʷe-​ʃi kada-​kʷe-​ʃi ɨɨN ɨɨN tʃiu]
one alone-​only-​pl-​she​pl-​she one one work
‘only one of them will do each of the jobs.

246 Ellen M. Kaisse

Another, less dramatic example from a non-​Bantu language can be found in
the Nigerian Edoid language Etsako (Elimelech 1976.) In Etsako, a sequence
of toneless syllables become High before a High tone, crossing word bound-
aries. And in Lango, an East Nilotic language of Uganda (Noonan 1992),
High tone spreads up to the next stressed syllable, ignoring word boundaries.
Chinese tone sandhi can also act at a distance, affecting a string of words.
A good example can be found in the work of Chan and Ren (1988) on Wuxi,
a northern Wu dialect. In a string of words within a phrase, a process that
they term Pattern Extension occurs. All the underlying tones after the first
word are wiped out. The (possibly complex) tone of the first word is spread
across the phrase. Example (26) below shows a case where the first word has
a rise-​fall (LHL) tone. (Chan and Ren do not provide the underlying tones
of the following words but they are content words that have tones of their
own in other contexts.) The first two autosegments (LH) of the first word’s
tone attach to the first syllable of the phrase, the last tone (L) attaches to the
last syllable of the phrase, and all the syllables that have lost their own tones
receive a spread tone from the second (middle) H autosegment.

(26) di    vəq tçhi le

LHL non-​initial tones wiped out
LH H H HL LHL associated across words
bring not up hither
‘unable to bring up’

8.6 Conclusion
Almost any kind of local process can apply between words. Adjacent
consonants can undergo assimilation across a word boundary, adjacent
vowels can undergo deletion or gliding, and final consonants can be recruited
as onsets to the next, vowel-​initial word. This exuberance of types is prob-
ably due to the fact that most phonologized processes start life as natural
local effects and these effects are not sensitive to grammatical information
but rather to temporal adjacency. (Kiparsky 1982 et. seq.) Apparently, it is
not difficult for such effects to be grammaticized, though they may remain
optional in the sense that they are most likely to apply in rapid, unguarded,
or informal speech. Iterative processes, such as vowel harmony, consonant
harmony, and metrical stress assignment, however, are more phonologized.
Most vowel harmony rules are word-​bounded, though ‘word’ may be defined
phonologically rather than morphologically, so that syntactically independent
function words and clitics may be included in their domain. I ascribed the
rarity of postlexical vowel harmony to a variety of factors. The phonetic
precursors for iteration become weaker the farther one gets from the trigger
vowel; content words are not in frequent collocation with one another so
there are few exemplars to lead to phonologization of these weak effects;
and, from an information-​sparing perspective, fully assimilating a contrastive

What kinds of processes are postlexical? 247

feature of all the vowels in a content word would neutralize a great deal of
lexical information. Consonant harmony and iterative foot-​formation, a.k.a.
stress assignment, show similar limitation to within-​word domains, probably
for some of the same reasons.
However, there does seem to be a syndrome involving weak postlexical
vowel harmony, which has been noted piecemeal by researchers for the par-
ticular language they analyzed and which I brought together here. We saw
four cases (Akan, Vata, Gwa Nmle, and Kinande) where harmony can extend
backwards to affect a single, eligible vowel in a content word A when the
following content word B contains a trigger. We also saw slightly more long-​
distance but optional and gradient effects in Kinande and Nawuri, where
more than one syllable in an adjacent word might be affected but their
vowels were not neutralized with a fully +ATR vowel as they would be in a
phonologized process. The only fully grammaticized iterative harmony pro-
cess we found occurring between content words was the long-​distance ATR
spread reported for Isaaq Somali. The Kinande case also brought up another
point. Sometimes content words, such as the few lexical adjectives in Bantu
or pronominal adjectives in fixed phrases in Macedonian stress domains, can
approach the postlexical, phonologically interactive behavior of function
words. Because fixed phrases and closed classes lead to frequent collocation,
this behavior accords with the view of phonologization I am advocating here.
To summarize, the effect of iterative non-​tonal processes almost always
ends when the edge of the word is reached, or it fades to nothing one syllable
into the adjoining word. But tone is special. There are many cases, in Bantu,
Chinese, Mixtec, other tonal languages, where a single underlying tone can
spread its effects to tone-​bearing units many syllables away, and occasion-
ally even to non-​adjacent words within the same phonological phrase. Yet
the case of Copperbelt Bemba shows that the postlexical instantiations of
tone rules can also exhibit the same sort of single-​syllable restriction that
we have noted for vowel harmony. Most Chinese tone sandhi is not very
powerful either: It only affects an adjacent syllable within the same foot.
We can conclude that the phonetic processes that give rise to phonological
rules are local, for the most part, or, as in the case of vowel-​to-​vowel coar-
ticulation, their longer-​distance effects fall off rapidly with distance from the
trigger. Becoming iterative requires phonologization of a long-​distance pro-
cess, and phonologization is facilitated in the frequent collocations of roots
with closed class affixes.

1 I am grateful to Hamed Al-​Tairi, Ryan Bennett, Gunnar Hansson, Sharon Hargus,
Beth Hume, Larry Hyman, Nancy Kula, Andrew Livingston, Dan McCloy, Laura
McGarrity, Philip Mutaka, Andrew Nevins, David Odden, Douglas Pulleyblank,
Stephanie Shih, Richard Wright, and audiences at the Annual Meeting on
Phonology (University of British Columbia, Vancouver 2015) and the First

248 Ellen M. Kaisse

International Conference on Prosodic Studies: Challenges and Prospects (Tianjin
Normal University, Tianjin 2015). Special thanks to Hongming Zhang for organ-
izing the latter conference and for the invitation that gave rise to my thinking about
the issues treated in this chapter.
2 Because I will refer to many phenomena and many languages in this relatively brief
paper, references to languages will usually pick out just one or two particularly
accessible or relevant descriptions.
3 Kimper (2011b) reads Kaye as saying that several monosyllabic content words can
be included in a single domain, but Kaye’s examples are ambiguous. In the example
below, the only one I find in Kaye where two words become +ATR, there are two
+ATR triggers, each of which affects one preceding monosyllable. Triggers and
targets are bolded.

/​n̋ kɑ́ sū zɔ̋ zi̋ zò/​→ [n̋ kʌ́‿sū ző‿zi̋ zò] Kaye p 123
I FUT tree under hide ‘I will hide under a tree’

4 Only high vowels are distinctively +ATR and –​ATR in Kinande. Other vowels are
underlyingly –​ATR only.
5 Odden does not mention the dialect of her speaker.
6 Barnes (2006) cites Inkelas et al. (2001) for evidence that phonetic vowel-​to-​vowel
coarticulation is problematic as a simple, unaided source for vowel harmony.
Inkelas et al.’s argument comes from Turkish, where anticipatory phonetic effects
are stronger than perseveratory ones, but the phonologized harmony system is
perseveratory. He instead attributes the phonologization of vowel harmony to
vowel-​to-​vowel coarticulation coupled with lengthening of the trigger syllable and
paradigm uniformity effects allowing longer-​distance effects on distant affixes.
7 A particularly nice instantiation of this collocation effect can be found in the work
of Côté (2013) on liaison in Laurentian French. Côté uses transition probabilities
between various invariant words (such as adverbs and prepositions) and following
parts of speech to predict the likelihood that a liaison consonant will appear
between the first word and the second word.
8 Hamed Al-​Tairi personal communication 2015) informs me that Bukshaisha (1985,
cited in Habis (1998)) reports that in Qatari Arabic, emphasis spread, a type of
consonant-​to-​vowel or consonant-​to-​consonant harmony, is not constrained to one
single word and can cross word boundaries. I have thus far been unable to obtain
a copy of Bukshaisha’s dissertation, but Habis’ examples and summary (p. 167ff)
suggest that the Qatari example is typically postlexical in showing a fading effect
on F2 from the triggering consonant to the emphasized vowel, extending about 600
msec. All the examples cited by Habib affect only the nearest vowel in an adjacent
word, similar to the postlexical vowel harmony cases considered in the previous
section, but Habis cites only monosyllabic target words. This case is clearly in need
of further consideration.

Andrzejewski, B. W. (1955) “The problem of vowel representation in the Isaaq dialect
of Somali”, Bulletin of the School of Oriental and African Studies, 17(3), 567–​580.
Archangeli, D., and Pulleyblank, D. (2002) “Kinande vowel harmony: Domains,
grounded conditions and one-​sided alignment”, Phonology, 19(2), pp. 139–​188.

What kinds of processes are postlexical? 249

Armstrong, L. E. (1934) The phonetic structure of Somali. Berlin: Mitteilungen des
Seminars für orientalische Sprachen.
Arvaniti, A. (1999) “Standard Modern Greek”, Journal of the International Phonetic
Association, 29(2), pp. 167–​172.
Barnes, J. (2006) Strength and weakness at the interface: Positional neutralization in
phonetics and phonology. Berlin: Mouton de Gruyter.
Beckman, J. (2004) “Positional faithfulness” in McCarthy, J. (ed.) Optimality theory in
phonology: A Reader. Oxford: Blackwell, pp. 310–​342.
Bickmore, L., and Kula, N. C. (2013) “Ternary spreading and the OCP in Copperbelt
Bemba”, Studies in African Linguistics, 42(2), pp. 101–​132.
Bukshaisha, F. (1985) An experimental study of some aspects of Qatari Arabic. Ph.D.
Diss., University of Edinburgh.
Bybee, J. (2006) “From usage to grammar: The mind’s response to repetition”,
Language, 82(4), pp. 529–​551.
Casali, R. F. (2002) “Nawuri ATR harmony in typological perspective”, Journal of
West African Languages, 29(1), pp. 3–​43.
Cassimjee, F., and Kisseberth, C. (1998) “Optimal domains theory and Bantu
tonology: A case study from Isixhosa and Shingasidja” in Hyman, L., and
Kisseberth, C. (eds.) Theoretical aspects of Bantu Tone. Stanford, CA: CSLI, pp.
Chan, M., and Ren, H. (1988) “Wuxi tone sandhi: From last to first syllable dominance”,
Acta Linguistica Hafniensia: International Journal of Linguistics, 21(2), pp. 35–​64.
Chen, M. (2000) Tone sandhi: Patterns across dialects. Cambridge: Cambridge
University Press.
Coleman, J., Renwick, M., and Temple, R. (2016) “Probabilistic underspecification in
nasal place assimilation”, Phonology, 33(3), pp. 425–​458.
Côté, M.-​H. (2013) “Understanding cohesion in French liaison”, Language Sciences,
39, pp. 156–​166.
Daly, P., and Hyman, L. (2007) “On the representation of tone in Peñoles Mixtec”,
IJAL, 73(2), pp. 165–​207.
Dawson, W. (1980) Tibetan phonology. Ph.D. Diss., University of Washington.
Dolphyne, F. A. (1988) The Akan (Twi-​Fante) language: Its sound systems and tonal
structure. Accra: Universities of Ghana Press.
Dykstra, G. (1955) Spectrographic analysis of Spanish sibilants and its relation
to Navarro’s physiological phonetic descriptions. Ph.D. Diss., University of
Elimelech, B. (1976) “A tonal grammar of Etsako”, UCLA Working Papers in
Phonetics, 35.
Franks, S. L. (1987) “Regular and irregular stress in Macedonian”, International
Journal of Slavic Linguistics and Poetics, 35–​36, 93–​142.
Habis, A. (1998) Emphatic assimilation in Modern Standard Arabic: An experimental
approach to Quranic recitation. Ph.D. Diss., University of Edinburgh.
Hansson, G. Ó. (2010) Consonant harmony: Long-​distance interaction in phonology.
University of California Publications in Linguistics, 145. Berkeley: University of
California Press.
Harris, J. W. (1985) “Autosegmental phonology and liquid dissimilation in Havana
Spanish”, In King, L. D., and Maley, C. A. (eds.) Selected papers from the 13th
Linguistic Symposium on Romance Languages. Amsterdam: John Benjamins, pp.
Hayes, B. (1995) Metrical stress theory. Chicago: University of Chicago Press.

250 Ellen M. Kaisse

Hyman, L. (2011) “Tone: Is it different?” in Goldsmith, J. et al. (eds.) Handbook of
phonology. Somerset, NJ: Wiley, pp. 197–​239.
Hyman, L., Katamba, F., and Walusimbi, L. (1987) “Luganda and the strict layer
hypothesis”, Phonology Yearbook, 4, pp. 87–​108.
Inkelas, S., Barnes, J., Good, J., Kavitskaya, D., Orgun, O., Sprouse, R., and Yu,
A. (2001) “Stress and vowel-​to-​vowel coarticulation in Turkish”. Presented at
the annual meeting of the Linguistic Society of America, Washington, DC,
January 2001.
Inkelas, S., and Zec, D. (1990) The phonology-​syntax connection. Chicago: University
of Chicago Press.
Jones, D., and Ward, D. (1969) The phonetics of Russian. Cambridge: Cambridge
University Press.
Jun, S.-​A. (1996) The phonetics and phonology of Korean prosody: Intonational phon-
ology and prosodic structure. New York: Garland.
Jun, S.-​A. (1998) “The accentual phrase in Korean phonology”, Phonology, 15(2), pp.
Kaisse, E. M. (1977) Hiatus in Modern Greek. Ph.D. Diss., Harvard University,
Cambridge, MA.
Kaisse, E. M. (1985) Connected speech: The interaction of syntax and phonology.
Orlando: Academic Press.
Kaisse, E. M. (2017) “The domain of stress assignment: Word-​boundedness and
frequent collocation” in Bowern, C., et al. (eds.) On looking into words (and
beyond): Structures, relations, analyses. Berlin: Language Science Press, pp. 17–​40.
Kaye, J. (1982) Vata harmony. MS, Université de Québec à Montréal.
Kimper, W. (2011a) Competing triggers: Transparency and opacity in vowel harmony.
Ph.D. Diss., University of Massachusetts, Amherst.
Kimper, W. (2011b) “Domain specificity and Vata ATR spreading”, Proceedings of the
34th Annual Penn Linguistics Colloquium, 17, pp. 155–​164.
Kingston, J. (2003) “Mechanisms of tone reversal”, in Kaji, S. (ed.) Cross-​linguistic
studies of tonal phenomena. Tokyo: ILCAA, pp. 57–​120.
Kiparsky, P. (1982) “Lexical phonology and morphology” in Linguistic Society of
Korea (ed.) Linguistics in the morning calm. Seoul: Hanshin, pp. 3–​91.
Kisseberth, C., and Abasheikh, M. (1974) “Vowel length in Chi-​mwi:ni: A case study
of the role of grammar in phonology”, Bruck, A., and Chicago Linguistic Society
(eds.) CLS Parasession on Natural Phonology. Chicago: University of Chicago
Press, pp. 193–​209.
Kügler, F. (2015) “Phonological phrasing and ATR vowel harmony in Akan”,
Phonology, 32(1), pp. 177–​204.
Kula, N., and Bickmore, L. (2015) “Phrasal phonology in Copperbelt Bemba”,
Phonology, 32(1), pp. 147–​176.
Lewis, G. (2000) Turkish grammar. 2nd edition. Oxford: Oxford University Press.
Martin, A. T. (2007) The evolving lexicon. Ph.D. Diss., University of California, Los
McPherson, L., and Hayes, B. (2016) “Relating application frequency to morpho-
logical structure: The case of Tommo So vowel harmony”, Phonology, 33(1), pp.
Mutaka, N. (1995) “Vowel harmony in Kinande”, Journal of West African Languages,
25(2), pp. 41–​55.

What kinds of processes are postlexical? 251

Navarro Tomás, T. (1965) Manual de pronunciación española. Madrid: Consejo
Superior de Investigaciones Científicas.
Nespor, M., and Vogel, I. (1986) Prosodic phonology. Dordrecht: Foris.
Noonan, M. P. (1992) A grammar of Lango. Berlin: Mouton DeGruyter.
Obenga, S. (1995) “Vowel harmony in Gwa Nmle”, Afrikanistische Arbeitspapiere, 41,
pp. 143–​152.
Odden, M. (1980) Somali vowel harmony. Unpublished MS, University of Illinois.
Öhman, S. (1966) “Coarticulation in VCV utterances: Spectrographic measurements”,
JASA, 39(1), pp. 151–​168.
Paster. M., and Kim, Y. (2011) “Downstep in Tiriki”, Linguistic Discovery, 9(1) pp.
Penney, R. (1969) El habla pasiega. London: Tamesis.
Philippson, G. (1998) “Tone reduction vs. metrical attraction in the evolution of
Eastern Bantu tone systems” in Hyman, L., and Kisseberth, C. W. (eds.) Theoretical
aspects of Bantu tone. Stanford, CA: CSLI, pp. 315–​329.
Phonology 32(1): Constituency in Sentence Phonology. (2015) Ed. by Selkirk, E. O.,
and Lee, S.
Phonology Yearbook 4: Syntactic Influences on Phonology. (1987) Ed. by Kaisse, E. M.,
and Zwicky, A.
Selkirk, E. O. (2011) “The syntax-​phonology interface”, in Goldsmith, J., et al. (eds.)
The handbook of phonological theory. 2nd edition. Oxford: Wiley-​Blackwell, pp.
Shih, S. (2014) Towards optimal rhythm. Ph.D. Diss., Stanford University.
Shih, S. (2016) “Phonological influences in syntactic alternations” in Gribanova, V.,
and Shih, S. (eds.) The morphosyntax-​phonology connection: Locality and direction-
ality at the interface. Oxford: Oxford University Press, pp. 223–​252.
Sy, M. (2005) “Ultra long-​distance ATR agreement in Wolof ”, BLS, 31, pp. 95–​106.
Yip, M. (2002) Tone. Cambridge: Cambridge University Press.

Match Theory and prosodic
well-formedness constraints
Junko Ito and Armin Mester

9.1 Introduction1
Several strands of work in prosodic theory have recently converged around a
number of common themes, from different directions. Selkirk (2009) (see also
Elfner 2012) has developed a vastly simplified approach to the syntax-​prosody
mapping that distinguishes only three levels (word, phrase, and clause), and
syntactic constituents are systematically made to correspond to phonological
domains (Match Theory). In an independent line of research, a long string
of papers reaching back into the 1980s has convincingly demonstrated that
recursive structures are by no means an exclusive property of syntax, but also
play a crucial role in phonology. Even though at variance with strict layering
(Selkirk 1984; Nespor and Vogel 1986), the empirical existence of recursive
prosody is undeniable, as first demonstrated by Ladd (1986, 1988), whose
findings have been corroborated by Kubozono (1989, 1993,) Schreuder and
Gilbers (2004), Gussenhoven (2005), Wagner (2005, 2010), Schreuder (2006),
Kabak and Revithiadou (2009), Ito and Mester (2009b), Féry (2010), and
van der Hulst (2010), to name a few, undermining a central tenet of orthodox
prosodic hierarchy theory that supposedly sets phonology apart from syntax.
Building on these empirical findings, Ito and Mester (2007, 2009a, 2013) have
gone on to argue that, beyond its sheer existence, prosodic recursion allows
for a vast, and much-​needed, simplification in the inventory of prosodic cat-
egories themselves. The empirically necessary subcategories that the data of
individual languages often seem to demand (such as the minor versus the
major phonological phrase of Japanese, long established under these names
since McCawley (1968), and rechristened into “accentual” versus “inter-
mediate phrase” by Pierrehumbert and Beckman (1988)) are not separate cat-
egories, each existing on its own in some (or all?) language(s), but are rather
instances of a single recursively deployed basic category. These results are
very much in harmony with central ideas in Match Theory, and recent work
(Selkirk 2011; Ishihara 2014) has successfully connected the two theories into
a larger framework.
One of the hallmarks of Match Theory is the idea that the main force inter-
fering with syntax-​prosody isomorphism is not some kind of non-​isomorphic

Match and well-formedness 253

mapping algorithm flattening out the structure, as first contemplated in the
Sound Pattern of English (Chomsky and Halle 1968: 372) and more fully
worked out in later proposals, such as the edge-​based theory built on one-​
sided alignment (Selkirk 1986). It is rather the effect of genuine phonological
well formedness constraints on prosodic structure: Whenever such constraints
dominate a Match constraint, prosodic structure is forced to diverge from
its syntactic model, often in significant ways. A large part of the explanation
of syntax-​prosody mismatches, then, lies in the precise content of these well-​
formedness constraints that conflict with the entailments of isomorphism.
We will here survey four types of constraints that interact with Match
constraints in this way: NoLapse (against tonal lapses, see Ito and Mester
2013), EqualSisters (against sister nodes in prosodic trees of different levels
in the prosodic hierarchy, see Myrberg 2013), and constraints enforcing
binary branching in prosodic trees, both minimally (BinMin) and maximally
(BinMax), which have been widely discussed in the literature.
The goal of this chapter is to cast a critical eye on the way these kinds of
prosodic constraints interact with Match constraints in recent work in pros-
odic theory, our own included, and to isolate a number of issues that are in
need of deeper investigation. We restrict ourselves to work within optimality-​
theoretic phonology (OT; see Prince and Smolensky 2004), and will make
concrete proposals and conjectures at various junctures, identifying problem
areas along the way.
The primary data we will use as a testing ground are taken from an area
we are familiar with, the prosodic form of Tokyo Japanese utterances (hence-
forth “Japanese”), and leave for future exploration the more detailed investi-
gation of other languages that might provide equally valuable or even better
cues. Our overall goal, however, is not to provide a comprehensive analysis of
Japanese but rather to subject a number of theories and ideas to an empirical
test. That being said, at the end we hope to end up with some understanding
of the kinds of constraints that are necessary to explain a prosodic system.

9.2 Tonal antilapse constraints

The first type of constraint we will take up militates against tonal lapses: stretches
of low-​toned material exceeding a certain limit, typically at the ends of words
and phrases. In this section, we will first review relevant data from Japanese and
their analysis, and then turn to a comparison with Basque.

9.2.1 NoLapse in Japanese

Ito and Mester (2013) develop an analysis of the way Japanese utterances are
parsed into phonological phrases where NoLapse plays a central role in for-
cing the accentual fall to occur late in the word. A virtue of this approach is
that the orientation of the accent toward the end of the word is explained by
substantive tonal factors, not by stipulated formal right alignment.2 The facts

254 Junko Ito and Armin Mester

here have been well known since Kubozono (1989, 1993), and we illustrate
them with phrases consisting of two content words (after Vance 2008: 181).
The parses assigned to these examples by the theory proposed in Ito and
Mester 2013 appear in the second column in (1), where (1cd) crucially involve
recursive phrasing, as first recognized by Kubozono.

(1) S: P: Example: Gloss:

a. [XP [XP u] u] (φ u u) [[Hiroshima-​no] ‘Hiroshima fish and …’
b. [XP [XP u] a] (φ u a) [[Hiroshima-​no] ‘Hiroshima eggs and …’
c. [XP [XP a] a] (φ (φ a) (φ a)) [[Okáyama-​no] ‘Okayama eggs and …’
d. [XP [XP a] u] (φ (φ a) (φ u)) [[Okáyama-​no] ‘Okayama fish and …’

Here and in what follows, φ stands for “phonological phrase”, ω for a

“phonological word”, a for “accented ω”, u for “unaccented ω”, and syntactic
and prosodic phrasing are labeled “S” and “P” and indicated by […] and (…),
respectively. The differences between these parses –​flat phrasing in (1ab), recur-
sive phrasing in (1cd)), but never exactly mirroring the syntax –​are entirely due
to the locations of accented and unaccented words. The beginning of a phono-
logical phrase is in Japanese cued by a tonal rise. It is always possible, in careful
pronunciation, to parse each word as a separate φ, with its own initial rise, but
this is not the usual pattern. While two a’s are each parsed as a separate phrase
(because each accent has to be the head of a minimal phrase), u is typically
phrased together with an adjacent a or u (because one-​word phrases violate
binarity). This is where Kubozono (1993: 150–​154) discovered a directional
asymmetry: While two u’s are phrased together, u is only phrased together with
a following a, not with a preceding a. So the results are (1a) (uu) and (1c) ((a)
(a)), but (1d) ((a)(u)) with an initial rise at the beginning of the second word
and (1b) (ua) without such a rise. The data are subject to considerable vari-
ation. We focus here exclusively on the majority patterns.
The analysis in Ito and Mester (2013) involves the four constraints in (2).
Following Elfner (2012: 28), we sharpen the definition of matching in the
following way: In order for a phonological constituent p to match a syntactic
constituent s, p must exhaustively dominate all and only the phonological
exponents of the terminal nodes that s exhaustively dominates. As a result,
syntactic constituents introducing no overt terminal elements are invisible to
the matching process, and do not need to be separately matched.

(2) a. AccentAsHead: Every accent is the head of a minimal phrase φmin.3

Assign one violation for each accent that is not the head of a φmin.
b. NoLapse-​L: No tonal lapses. Assign one violation for each fully L-​
toned ω in φ.

Match and well-formedness 255

c. MinimalBinarity-​φ/​ω: φ is minimally binary. Assign one violation for

each φ that does not dominate at least two ω.
d. Match-​XP-​to-​φ: A phrase XP in syntactic constituent structure is
matched by a corresponding phonological phrase φ in phonological

Ranked as in tableau (3),4 these constraints derive the different parses in (1).
The syntactic pattern [[u]x], where x = a or u, is parsed as the non-​isomorphic
single φ (3ae) (ux), violating bottom-​ranked Match-​XP but satisfying higher-​
ranked BinMin in the optimal way. However, [[a]x] is parsed as (3im) ((a)(x)),
violating BinMin: Isomorphic (3l) ((a)a) is out because the second a violates



(3) Japanese with NoLapse

[[u]‌u] a. ► (uu) *
b. (u(u)) *W *
c. ((u)u) *W L
d. ((u)(u)) **W L
[[u]‌a] e. ► (ua) *
f. (u(a)) *W *
g. ((u)a) *W *W L
h. ((u)(a)) **W L
[[a]‌a] i. ► ((a)(a)) **
j. (aa) *W L *W
k. (a(a)) *W *L *W
l. ((a)a) *W *L
[[a]‌u] m. ► ((a)(u)) **
n. (au) *W L *W
o. (a(u)) *W *L *W
p. ((a)u) *W *L

The most interesting case is [[a]u] parsed as (3m) ((a)(u)), with a rise on u,
as depicted in (4d). The main tonal events in these examples are indicated with
schematic pitch contours.

256 Junko Ito and Armin Mester

a. =(3a) ( u u )

b. =(3e) ( u a ) NOLAPSE-L fulfilled

in the winning candidates,
c. =(3i) (( a )( a )) where no is ful ly L-toned.

d. =(3m) (( a )( u ))

e. =(3n) ( a u ) NOLAPSE-L violated: the

finalu is fully L-toned after
f. =(3p) (( a ) u ) the accentual fall.

A fully L-​toned ω arises after an accentual fall unless it is in its own φ

(thereby receiving the tonal rise on its own). The tonally low final u in the
losing candidates (3np) violates NoLapse-​L. By contrast, the leading u in
(3ae) is tonally high, and does not violate NoLapse-​L. In this analysis, the
directional asymmetry thus has an explanation rooted in the very shape of the
tonal melody of (Tokyo) Japanese (%LH-​ and%LH-​H*L).

9.2.2 Comparison with Basque

Selkirk and Elordieta (2010) propose a slightly different analysis of Japanese
along similar lines, the chief difference being that the phrasing (au) is ruled out not
because the post-​accentual u is a tonal lapse, but because it separates the accentual
head a from the end of its phrase. Instead of NoLapse-​L, their analysis appeals
to the alignment constraint Align-​Right(φmin-​Head, φmin), which requires the
accented word a to be at the right edge of its φmin. Among the relevant candidates
in (6), both ((a)u) (6p) and ((a)(u)) (6m), but not (au), fulfill Align-​R. In order to
select (6m) over (6p), they introduce a further constraint in (5) into the analysis.

(5) EqualSisters: Sister nodes in prosodic structure are instantiations of the

same prosodic category.

Ranked above BinMin, EqualSisters5 (Myrberg 2013: 75) selects the

correct candidate ((a)(u)) in (6m).6



Japanese with
Align-​R and EqualSis

[[u]‌u] a. ► (uu) *
b. (u(u)) *W *W *
c. ((u)u) *W *W L
d. ((u)(u)) **W L

Match and well-formedness 257



Japanese with
Align-​R and EqualSis

[[u]‌a] e. ► (ua) *
f. (u(a)) *W *W
g. ((u)a) *W *W *W *W L
h. ((u)(a)) **W L
[[a]‌a] i. ► ((a)(a)) **
j. (aa) *W *W *L *W
k. (a(a)) *W *W *W L
l. ((a)a) *W *W *W *L
[[a]‌u] m. ► ((a)(u)) **
n. (au) *W L *W
o. (a(u)) *W *W *L *W
p. ((a)u) *W *L

Selkirk and Elordieta (2010) insightfully contrast the Japanese phrasings

in (1) with the minimally different phrasings of corresponding examples in
Northern Bizkaian Basque (henceforth “Basque”), which has a pitch accent
system tantalizingly close to that of Japanese, including a very similar word
melody (for an overview, see Gussenhoven 2004: 170–​184). The chief diffe-
rence is that the u of [[a]u], in Japanese parsed as its own phrase ((a)(u)), is in
Basque not parsed as a separate phrase. Selkirk and Elordieta (2010) express
this by assigning the parse ((a)u) (where nothing in the data seems to signal
the presence of the phrase (a), however).

(7) Japanese: [[a]‌u]  ((a)(u)) [EqualSisters >> BinMin]

Basque: [[a]‌u]  ((a)u) [BinMin >> EqualSisters]

They derive the different outcomes in Japanese and Basque by the ranking
scenario in (7), with the result shown in (8).



Basque with Align-​R

and EqualSis

[[u]‌u] a. ► (uu) *
b. (u(u)) *W *L *
c. ((u)u) *W *L W
d. ((u)(u)) **W W

258 Junko Ito and Armin Mester



Basque with Align-​R
and EqualSis

[[u]‌a] e. ► (ua) *
f. (u(a)) *W *W *
g. ((u)a) *W *W *W *W L
h. ((u)(a)) **W L
[[a]‌a] i. ► ((a)(a)) **
j. (aa) *W *W L *W
k. (a(a)) *W *W *L *W *W
l. ((a)a) *W *W *L *W
[[a]‌u] m. ► ((a)u) * *
n. (au) *W L L *W
o. (a(u)) *W * * *W
p. ((a)(u)) **W L

In our own analysis with NoLapse instead of Align-​R and EqualSisters, the
Basque system emerges when NoLapse ranks below BinMin, as shown in (9).7



(9) Basque with NoLapse

[[u]‌u] a. ► (uu) *
b. (u(u)) *W *
c. ((u)u) *W L
d. ((u)(u)) **W L
[[u]‌a] e. ► (ua) *
f. (u(a)) *W *
g. ((u)a) *W *W L
h. ((u)(a)) **W L
[[a]‌a] i. ► ((a)(a)) **
j. (aa) *W L *L
k. ((a)a) *W *L
l. (a(a) *W *L *W
[[a]‌u] m. ► (au) * *
n. (a(u)) *W L *
o. ((a)u) *W * L
p. ((a)(u)) **W L L

Match and well-formedness 259

In Section 9.3, we look more closely at the workings of EqualSisters in
larger structures, and encounter some advantages for an analysis building on
a tonal NoLapse constraint.

9.3 Constraints on sister nodes

EqualSisters (Myrberg 2013) is a new member of the set of separate and
violable constraints that takes the place of traditional strict layering (Ito and
Mester 1992; Selkirk 1996): Daughter nodes are not required to be exactly
one level below their mother nodes in the hierarchy, but they need to be of
the same level.
An interesting problem arises when we consider three-​member syntactic
constituents of the form [[[x]y]z]. In order to keep things simple, we restrict
our attention to left-​branching structures, which conform to the basic pattern
of Japanese phrase structure (see Ito and Mester (2013) for right-​branching
structures). Complex noun phrases as in (10) are examples.

(10) a. [[[u]‌u]u] [NP [NP [NP amerika-​no] tomodachi-​no]    pasokon]

    America -​ GEN friend -​GEN   PC
    ‘(my) American friend’s PC’
b. [[[a]‌a]a] [NP [NP [NP isuráeru-​no] kurasuméeto-​no] rapputóppu]
    Israel -​ GEN classmate-​GEN laptop
    ‘(my) Israeli classmate’s laptop’

As we systematically build examples with all combinations of accented

and unaccented words in the three N-​positions, we end up with the picture
in (11).

(11) S: [[[1]‌2]3] P: Schematically:

a. [[[u]‌u]u] ((uu)   u) } ((1 2) 3)

b. [[[u]‌a]u] ((ua)   (u)) 
c. [[[u]‌a]a] ((ua)   (a))  ((1 2) (3))
d. [[[u]‌u]a] ((uu)   (a)) 
e. [[[a]‌a]a] (((a)(a)) (a))
f. [[[a]‌a]u] (((a)(a)) (u))

g. [[[a]‌u]a] (((a)(u)) (a))  (((1)(2)) (3))
h. [[[a]‌u]u] (((a)(u)) (u)) 
The eight syntactic structures are mapped to three different prosodic
parses, depending on the accentedness of each constituent word: The rela-
tively flat ((1 2) 3) is assigned only to [[[u]u]u] (11a); ((1 2) (3)), with a phrased

260 Junko Ito and Armin Mester

3, is assigned when 1 is u (except for [[[u]u]u]) (11b-​d); (((1)(2)) (3)), where each
word is its own phrase, is assigned when 1 is a (11e-​h).
For an analysis based on the interplay of Align-​R, BinMin, and
EqualSisters, these outcomes present a conundrum, brought out clearly by
the comparative tableau (Prince 2000) in (12): In order for ((uu)u), without
any unary phrase, to win over the competing sister-​wise equal ((uu)(u)), we
need BinMin>>EqualSisters (12a). But in order for the sister-​wise equal
((ua)(u)) to win over the competing ((ua)u), which has one less unary phrase,
we need EqualSisters>>BinMin (12b).





a. [[[u]‌u]u] ((uu) u) ((uu) (u)) W L
b. [[[u]‌a]u] ((ua) (u)) ((ua) u) L W

The ranking contradiction disappears once NoLapse-​L takes the place of

Align-​R, since it is able to differentiate between the two cases: ((↑ua↓) u),
but not ((↑uu) u), ends on a low u, as indicated by the pitch arrows, violating
NoLapse-​L. This is shown in (13).





a. [[[u]‌u]u] ((↑uu) u) ((↑uu) (u)) W L

b. [[[u]‌a]u] ((↑ua↓)(↑u)) ((↑ua↓) u) W L W

But all is not well in EqualSisters-​land: A different confrontation of

winner-​loser pairs shows that the ranking contradiction between BinMin
and EqualSisters persists even with NoLapse-​L. All that is needed is a
case where NoLapse-​L does not differentiate the candidates. Let us takes
the inputs [[[u]u]u] and [[[a]u]u], as in (14). In order for ((uu)u) to win over
((uu)(u)), we again need BinMin>>EqualSisters (14a). But in order for
(((a)(u))(u)) to win over the competing (((a)(u))u)), which has one less unary
phrase, we need EqualSisters >> BinMin (14b). This time, NoLapse-​
L fails to resolve the ranking contradiction between EqualSisters and

Match and well-formedness 261





a. [[[u]‌u]u] ((↑u u) u) ((↑ u u)(↑u)) W L
b. [[[a]‌u]u] (((↑a↓)(↑u))(↑u)) (((↑a↓)(↑u)) u) L W

As we consider the two pairs, a way out might suggest itself since the two
cases differ in terms of the severity of sister inequality. This becomes clearer
when we inspect more detailed representations, with explicit indications of
projection levels, as in (15).

(15) a. winner loser

1 1

0 0 0

u u u u u u

b. winner loser
2 2

1 1

0 0 0 0 0

a u u a u u

Whereas the winner in (15a) has a pair of sister nodes (ω, φo), the loser
in (15b) has a pair (ω, φ1). EqualSisters theorists might seize on this diffe-
rence and expand the constraint, in the familiar OT manner, into a family
of constraints penalizing sister inequality of different degrees of severity.
Let us assume, for concreteness, that besides the general EqualSisters con-
straint (5) penalizing any difference in category between sister nodes, there is
a more stringent constraint penalizing a situation where a category inequality
is aggravated by a concomitant projection level inequality. We might call
the more stringent constraint EqualSisters-​2, violated when λj is sister to
κi, with λ>κ and j>i. Ranked above BinMin, EqualSisters-​2 removes the
problem, as (16b) shows.

262 Junko Ito and Armin Mester






a. [[[u]‌u]u] ((uu)  u) ((uu) (u)) W L
b. [[[a]‌u]u] (((a)(u)) (u)) (((a)(u)) u) W L W
c. [[[u]‌a]u] ((ua)  (u)) ((ua)  u) W L W
d. [[[u]‌a]a] ((ua)  (a)) ((ua)  a) W L W
e. [[[u]‌u]a] ((uu)  (a)) ((uu)  a) W L W
f. [[[a]‌a]a] (((a)(a)) (a)) (((a)(a)) a) W W L L
g. [[[a]‌u]a] (((a)(u)) (a)) (((a)(u)) a) W W L L
h. [[[a]‌a]u] (((a)(a)) (u)) (((a)(a)) u) W W L L

Given the two versions of EqualSisters, all eight accented-​unaccented

combinations of three-​member syntactic constituents (11) emerge in (16) with
their correct prosodic structure. We leave it for future research in Japanese and
other languages to determine whether admitting different/​stringent versions
of EqualSisters is a fruitful avenue to explore, and if so, what other types of
EqualSisters might play a role in a grammar.

9.4 Prosodic binarity constraints

Besides the minimal version of binarity (17a) (repeated from (2) above),
the maximal version (17b) also plays a crucial role in larger syntactic

(17) Binarity constraints

MinimalBinarity-​φ/​ω: φ is minimally binary. Assign one violation for
each φ that does not dominate at least two ω.
MaximalBinarity-​φ/​ω: φ is maximally binary. Assign one violation for
each φ that dominates more than two ω.

For four-​member left-​branching structures of the form [[[[1]‌ 2]3]4],

Kubozono (1989, 1993) (see also Shinya, Selkirk, and Kawahara 2004) has
convincingly shown that the prosodic structure is always such that the imme-
diate daughters of the maximal φ contain two ω’s, as exemplified with all
accented words in (18a) (from Ishihara 2014) and all unaccented words
in (18b).

Match and well-formedness 263

(18) Left-​branching structure [[[[1]‌2]3]4]

a. S: [[[a]‌a]a]a] [[[[Máriko-​ga] nónda] wáin-​no] niói]
Mariko-​nom drank wine-​gen smell
‘the smell of wine that Mariko drank’
P: (((a)(a))((a)(a))) (((↑Má↓riko-​ga)(↑nó↓nda)) ((↑wá↓in-​no)(↑nió↓i)))

b. S: [[[[u]‌u]u]u] [[[Mamoru-​ga] yonda]gakuchoo-​no] uwasa]

Mamoru-​nom invited president-​gen rumor
‘the rumor of the college president that Mamoru
P: (((uu)(uu))) ((↑Mamoru-​ga yonda) (↑gakuchoo-​no uwasa))

The evidence for these strictly binary prosodic parses (due to Kubozono
1989, 1993) are the initial rises (marked by up-​arrows) in every φ and, in
the case of accented sequences, the extra rhythmic boost before the third
ω (indicated by the larger up-​arrow). As shown earlier in (3), because of
undominated NoLapse-​L and AccentAsHead, there are only four licit 2ω-​
structures: (uu), (ua), ((a)(a)), and ((a)(u)), and joined into 4ω-​structures they
yield the 4´4=16 combinatorial possibilities depicted in (19).

(19)  ( uu )   ( uu ) 
 ( ua )   ( ua ) 
   
   
(( a )( a ))  (( a )( a )) 
(( a )( u )) (( a )( u ))

Why ((12)(34)), rather than the more closely matching ((12)3)4)? Ishihara
(2014), following up on an informal suggestion in Selkirk (2011), gives an
explicit OT analysis summarized in (20).



S: P:

[[[[u]‌u] u] u] a. ► ((u u) (u u)) * **

b. ((u u) u) u) **W *L
c. ((u)u) u) u) **W *W L
d. (u u u u) * ***W
[[[[a]‌a] a] a] e. ► (((a)(a)) ((a)(a))) * **** *
f. (((a)(a)) (a))(a)) **W **** L
g. ((a) (a) (a) (a)) * **** **W

264 Junko Ito and Armin Mester

The strictly binary candidate ((uu)(uu)) wins because it does perfectly
on B i n M i n -​φ and has only one violation on B i n M a x -​φ, at the cost of
two unmatched XPs. The same holds for winning (((a)(a))((a)(a))), with the
difference that each a also constitutes its own φ because its accent needs
to be the head of a phrase. The candidates considered here are all parsed
into a single φ at the top level, violating B i n M a x -​φ at least once. This is
the more important point made by Ishihara: There is a need for a separate,
and higher-​ranking, match constraint requiring XPmax to correspond to a
φ, ruling out the prosodically non-​cohering (uu)(uu). We will return to the
details in the next section, and here focus on the workings of the binarity
Low-​ranking Match-​XP decides in favor of ((uu)(uu)) against the com-
pletely flat (uuuu) (16d). The same holds for 3ω-​structure (21), where ((uu)u)
beats (uuu)–​ even though here, the empirical consequences are the same,
with no medial rise predicted (there are no known cues for the end of a φ in
Japanese), and Ishihara (2014) considers (uuu) to be the winner.




S: P:

[[[[u]‌u] u] a. ► ((u u) u) * *
b. (u u u) * **W
c. ((u u)(u)) * *W *

Further inspection reveals, however, that we cannot rely on low-​ranking

Match-​XP to determine the winner once more constraints are in place. It
turns out that we need an additional binarity constraint not counting ω’s, but
insisting on maximal binary branching. BinMaxBranch-​φ (22) rules out,
for example, the flat (uuu) and (uuuu). which do not just contain more than 2
ω’s but also have ternary and quaternary branching.

(22) BinMaximalBranch-​φ: φ obeys maximal binary branching: *(φ αβ³…).

Assign one violation for each φ that branches into more than two
daughter nodes.

The following tableau anticipates this point and shows that BinMax-​φ/​ω
does not distinguish between (23a) and (23d), but BinMaxBranch-​φ does.
On the other hand, BinMaxBranch-​φ does not distinguish between (23a-​c).

Match and well-formedness 265



S: P:

[[[[u]‌u] u] u] a. ► ((u u) (u u)) * **

b. ((u u) u) u) **W *L
c. ((u)u) u) u) **W *W L
d. (u u u u) *W * ***W

We will see below that while BinMax-​φ is crucially dominated by other

constraints, BinMaxBranch-​φ is undominated and never violated in
winning candidates. Pending further investigation into other systems, it
is perhaps worth keeping in mind whether the undominated nature of this
constraint may be an indication that Gen only produces unary and binary
prosodic constituents.
Questions about further types of binarity constraints remain to be
explored, such as binarity requirements on φ[+max], φ[-​max], or φ[-​min], and can only
be properly addressed with additional empirical evidence. Thus Selkirk and
Elordieta (2010), based on earlier work by Elordieta (2007), have proposed a
constraint requiring exact binarity (at the φ-​level) for IP-​initial φ in Basque
(BinMinMax-​φ/​φ, in our terms), a type of StrongStart constraint (Selkirk
2011), and we expect additional findings along these lines.

9.5 Syntax-​prosody matching constraints

Just as there are several versions of binarity constraints, previous research has
uncovered that syntax-​prosody mapping in Japanese calls for higher-​ranking
specific versions of Match constraints requiring maximal XP’s (24b) and
non-​minimal XP’s (24c) to correspond to phonological phrases.

(24) Match-​XP constraints

a. Match-​XP-​to-​φ (repeated from (2) above): A phrase XP in
syntactic constituent structure is matched by a corresponding
phonological phrase φ in phonological representation.
b. Match-​XP[+max]-​to-​φ: A maximal phrase XP in syntactic constituent
structure is matched by a corresponding phonological phrase φ in
phonological representation (Ishihara 2014).8

266 Junko Ito and Armin Mester

c. Match-​XP[-​min]-​to-​φ: A non-​minimal phrase XP in syntactic

constituent structure is matched by a corresponding phonological
phrase φ in phonological representation (Ito and Mester 2013).

As Ishihara (2014) correctly points out, without a higher-​ranking Match-​

XP[+max], candidates (25be), without a corresponding φ for their maximal XPs,
wrongly emerge as winners, because they do not violate BinMax-​φ, and the
general Match-​XP is ranked too low to rule them out.





S: P:
[[[[u]‌u] u] u] a. ► ((uu)(uu)) * * **
b. (uu) (uu) *W L * ***W
c. ((uu)u)u) **W L *L
[[[[a]‌a] a] a] d. ► (((a)(a))((a)(a))) * * **** **
e. ((a)(a)) ((a)(a)) *W L * **** ***W
f. ((((a)(a))(a))(a)) **W L **** *L

[Match-​XP[+max] >> BinMax-​φ] is determined by the WL pairing in can-

didate rows (25be). On the other hand, the non-​minimal version of Match-​
XP[-​min] (24c) proposed in Ito and Mester (2013) is violated by the winning
candidates (25a and d) (where the non-​minimal XP [[[ω]ω]ω]. does not have a
corresponding φ), and must be ranked below BinMax-​φ.
Instead of reviewing here the empirical rationale for this XP[-​min] con-
straint given in Ito and Mester (2013) involving 3ω-​sequences, we take a
step back and explore an alternative way of thinking about this type of
mismatch violation. In parallel OT, there is no need for syntax-​prosody
mapping to only go in one direction, from syntax to prosody, and it is
also possible to check whether the surface prosody matches the syntax,
as in fact suggested in Selkirk (2011). In ((uu)1 (uu)2), (uu)1 is matched by
[[u]u] in the syntax, but (uu)2 has no syntactic correspondent. Rather than
proliferating M at c h -​XP constraints, we propose here that the constraint
doing the crucial work is the prosody-​syntax matching constraint M at c h -​
φ defined in (26).

Match and well-formedness 267

(26) Match-​φ: A phonological phrase φ in phonological representation

is matched by a corresponding syntactic constituent in syntactic

As formulated, (26) only looks for a corresponding syntactic con-

stituent, not necessarily an XP. Match-​φ can simply take the place of
Match-​XP[-​min] in (25) above without any violation difference. We leave it
as a question for future research whether Match-​XP[-​min] is still necessary,
and similar questions can be asked about Match-​φ[+min], Match-​φ[+max],
Match-​φ[-​min], and Match-​φ[-​max].
Why is Match-​φ (replacing Match-​XP[-​min]) a necessary ingredient of the
analysis? For all-​accented and all-​unaccented 3ω-​cases, Match-​φ is not a cru-
cial factor (illustrated here by the all W-​markings in the rows of (27ab). On the
other hand, for cases of mixed accented/​unaccented 3ω-​cases, the rebracketed
candidates need to be ruled out by Match-​φ>>BinMin-​φ in (28cd). For full
tableaux with relevant candidates and detailed discussion, see Ito and Mester
2013, substituting Match-​φ for Match-​XP[-​min].

(27) Match-​XP[+max]





a. [[[[u]‌u]u] ((uu) u) (u (uu)) W W

b. [[[a]‌a]a] (((a)(a)) (a)) ((a) ((a)(a))) W W
c. [[[a]‌u]a] (((a)(u)) (a)) ((a) (ua)) W L W
d. [[a]‌u]u] (((a)(u)) (u)) ((a) (uu)) W L W

Moving on to the various 4ω-​combinations in (28), rebracketed candidates

are actually winners, and are selected by [BinMax-​φ >> Match-​φ].






a. [[[u]‌u]u]u] ((uu) (uu)) (((uu) u) u) W L L

b. [[[a]‌a]a]a] (((a)(a)) ((a)(a))) (((a)(a)) (a))(a))) W L L
c. [[[[a]‌u]u]u] (((a)(u)) (uu)) (((a)(u)) u) u) W L L
d. [[[[a]‌u]u]u] (((a)(u)) ((a)(u))) (((a)(u)) (a))(u)) W L L

268 Junko Ito and Armin Mester

The overall interplay of constraints from the Match and Binarity fam-
ilies has the interleaved structure in (29).

(29) Match Bin Match Bin Match

Match-​XP[+max] >>BinMax-​φ >>Match-​φ >>MinBin-​φ >>Match-​XP

9.6 Summary and conclusion

Throughout our research reported in this chapter, we have relied on
OTW(orkplace)9 to guide us in precisely formulating the constraints, and veri-
fying proper (non-​contradictory) rankings. With all ten constraints discussed
in this chapter (ranging from syntax-​prosody match constraints, to structural
constraints requiring binarity/​equal sisters, to accentual/​tonal constraints),
OTW allows us to check our proposed constraints and rankings, and in the
process, to also pinpoint problem areas and contradictory rankings, as well
as find unintended consequences, unwanted winners, and so on. In (30) and
(31), we present the OTW tableaux for two examples with crucial candidates
showing the established rankings.










S: P:
[[[a]‌u]u] a. ► (((a)(u))(u)) 1 3
b. (((a)(u))u) 1 1W 2L 1W
c. ((a)(uu)) 1 1W 1L 1W
d. ((a(u))(u)) 1W 1 2L 1W 1W
e. ((au)u) 1W 1 L 1W 1W

In (30), the leftmost sequence of WL in row (b) shows [EqualSis>>BinMin-​

φ], in row (c) [Match-​φ >> BinMin-​φ], in row (d) [AccAsHead>>BinMin-​
φ], and in row (e) [NoLapse-​L>> BinMin-​φ].

Match and well-formedness 269










S: P:
[[[[u]‌u]u]u] ► ((uu)(uu)) 1 1 2
(((u)u)(uu)) 1 1 1W 1L 1W
(((uu)u)u) 2W L 1W 1L 2W
(uuuu) 1W 1 L 3W
(uu)(uu) 1W L 1 3W

In (31), the leftmost WL sequence in row (b) shows [BinMin-​φ>> Match-​

XP], in row (c) [BinMax-​φ >>Match-​φ,] in row (d) [BinMaxBranch-​
φ>>Match-​φ], and in row (e) [Match-​MaxXP >>BinMax-​φ].
The overall constraint ranking, as diagrammed by OTW, is shown
in (32), where we see five undominated constraints, alternating types of
Match and Binarity constraints as the main axis, and remaining inter-
leaving constraints.

(32) Japanese constraint ranking:






An additional bonus of OTW is that it not only produces the OT grammar

under investigation but also the full factorial typology of the underlying con-
straint set. Within this typology, we indeed find the grammar generating the
Basque system (Section 9.2.2), as shown in (33). It is instructive to compare it
to the grammar of Japanese in (32) above.

270 Junko Ito and Armin Mester

(33) Basque constraint ranking:






The only crucial difference between the two systems is the ranking of
NoLapse-​L and BinMin-​φ:

(34) Japanese: [NoLapse-​L >> BinMin-​φ]

Basque: [BinMin-​φ >> NoLapse-​L]

Less important is a difference in the ranking of EqualSisters-​2. In the

Basque system, it is not ranked w.r.t. BinMin-​φ because the input [[[a]u]u]
comes out as (((au)u), that is, with the last u unphrased. This candidate fulfills
both EqualSisters-​2 and BinMin-​φ, so no ranking of the two is required.
In the Japanese system, the outcome is (((a)(u))(u)), fulfilling EqualSisters-​2
but violating BinMin-​φ. This means [EqualSisters-​2 >> BinMin-​φ].
The core data and their OT analysis, as summarized by OTW in its skeletal
basis, appear in a comparative tableau format in (35).

(35) Skeletal basis for Japanese syntax-​


prosody mapping

Input Winner Loser

a. [[[[u]‌u]u]u] ((uu)(uu)) (uu)(uu) W L L L
b. [[a]‌u] ((a)(u)) (au) W    L W
c. [[a]‌u] ((a)(u)) (a(u)) W L W W
d. [[[[u]‌u]u]u] ((uu)(uu)) (uuuu) W L W
e. [[[a]‌u]u] (((a)(u))(u)) (((a)(u))u) W L W
f. [[[[u]‌u]u]u] ((uu)(uu)) (((uu)u)(u)) W L W L W
g. [[[u]‌u]a] ((uu)(a)) (u(ua)) W L W W
h. [[u]‌u] (uu) ((u)u) W L W
i. [[[u]‌u]u] ((uu)u) ((uu)(u)) W L

Match and well-formedness 271

(36) Skeletal basis for Basque



syntax-​prosody mapping

Input Winner Loser

a. [[[[u]‌u]u]u] ((uu)(uu)) (uu)(uu) W L L L
b. [[a]‌a] ((a)(a)) (aa) W L W
c. [[[[u]‌u]u]u] ((uu)(uu)) (uuuu) W L W
d. [[[[u]‌u]u]u] ((uu)(uu)) (((uu)u)(u)) W L W L W
e. [[[u]‌u]a] ((uu)(a)) (u(ua)) W L W W
f. [[a]‌u] (au) ((a)(u)) W L L
g. [[[u]‌u]u] ((uu)u) ((uu)(u)) W L

In conclusion, it is remarkable that Match Theory (Selkirk 2011) has not

only resulted in a more principled and more streamlined understanding of
syntax-​prosody mapping relations, but has also opened up new perspectives on
many prosodic well-formedness constraints, as they impinge on the otherwise
expected isomorphism between syntactic structure and prosodic form. Using
well-​known generalizations about the way phrasal structures in Japanese
are parsed into prosodic units as a testing ground, this chapter has reported
on some preliminary results regarding the role of anti-​lapse constraints in
pitch accent systems (Section 9.2), the finer details of hierarchical equality
requirements on sister nodes (Myrberg 2013, Section 9.3), the diversity of
binarity constraints (Section 9.4), and the necessity of distinguishing sep-
arate instantiations of syntax-​prosody mapping constraints for higher-​level
syntactic constituents (Ishihara 2014; Section 9.5). Finally, we have ended up
with an initial analysis of the Japanese data under consideration (Section 9.6)
that gives some idea of the complexity of the system underlying these facts,
and of the variety of constraints needed to capture them.

1 Part of this research was presented at the 1st International Conference on Prosodic
Studies (ICPS-​1): Challenges and Prospects, June 2015, Tianjin, China, where we
benefited from fruitful discussions with many conference participants, in particular,
Carlos Gussenhoven, Ellen Kaisse, Chi-​Lin Shih, Irene Vogel, and Hongming
Zhang. We are grateful to Shin Ishihara, Sara Myrberg, and Alan Prince for pro-
ductive discussions of many of the issues dealt with in this chapter. Special thanks to
the 2015 syntax-​prosody proseminar participants at UC Santa Cruz, where the core
of the analysis was developed in discussions with Jeff Adler, Jenny Bellik, Steven

272 Junko Ito and Armin Mester

Foley, Nick Kalivoda, and Jason Ostrove as well as Jim McCloskey and Maziar
Toorsanvandi. We are responsible for all remaining errors and shortcomings.
2 NoLapse fits into a larger family of tonal anti-​lapse constraints that are at work in
pitch accent languages, such as Ancient Greek (see Ito and Mester 2013: 31).
3 I.e., a φ not dominating another φ, as defined in Ito and Mester 2007, 2013.
4 The tableaux below are violation tableaux with added comparative markings
(Prince 2000), with W’s and L’s appearing in the rows of losing candidates. “W” in
a constraint column indicates the winner is favored by the constraint, “L” indicates
the loser is favored, and no entry indicates a tie (i.e., the violation marks for the
winner equal those of the loser). In order for a ranking tableau to be consistent,
each L has to be preceded by a W in its row (in order to win, the winner needs to do
better than each loser on the highest-​ranked constraint that distinguishes the two,
in Jane Grimshaw’s succinct phrasing).
5 Formally called *Adjunction, under which name it appears in Selkirk and
Elordieta 2010.
6 This tableau and the following are our interpretation (Selkirk and Elordieta 2010
provide no tableaux).
7 The winning candidate here is (au), which we take to be the correct outcome since we
do not know of any evidence showing that the leading a is a separate phrase, as in
Selkirk and Elordieta 2010. If there is indeed reason to assign the parse ((a)u), it will be
the predicted winner once Align-​R is added to the system and ranked above MinBin.
8 The definition here is slightly revised from that in Ishihara 2014, which requires a
matching φ[+max]. We require only a matching φ in order to allow for prosodic cliti-
cization to XP[+max], for example.
9 OTWorkplace_​ X_​83, version of 27 June 2015. The program developed by
Alan Prince, Bruce Tesar, and Naz Merchant is open-​ source and distributed
without charge, downloadable from https://​​ site/​otworkplace/​.
OTWorkplace is a software suite that, in the words of its authors, “uses Excel as a
platform for interactive research with the analytical tools of modern rigorous OT”.

Chomsky, N., and Halle, M. (1968) The sound pattern of English. New York:
Harper & Row.
Elfner, E. (2012) Syntax-​prosody interactions in Irish. Ph.D. Diss. University of
Massachusetts, Amherst.
Elordieta, G. (2007) “Minimum size constraints on intermediate phrases”. Retrieved
15 August 2015,​conference/​Papers/​1682/​1682.pdf.
Féry, C. (2010) “Recursion in prosodic structure”, Phonological Studies, 13, pp. 51–​60.
Gussenhoven, C. (2004) The phonology of tone and intonation. Cambridge: Cambridge
University Press.
Gussenhoven, C. (2005) “Procliticized phonological phrases in English: Evidence from
rhythm”, Studia Linguistica, 59(2/​3), pp. 174–​193.
Hulst, H. van der. (2010) “A note on recursion in phonology” in Hulst, H. van der (ed.)
Recursion and human language. Berlin: Mouton de Gruyter, pp. 301–​342.
Ishihara, S. (2014) “Match Theory and the recursivity problem” in Kawahara, S., and
Igarashi, M. (eds.) Proceedings of FAJL 7: Formal Approaches to Japanese Linguistics.
MIT Working Papers in Linguistics 73. Cambridge, MA: MIT Press, pp. 69–​88.

Match and well-formedness 273

Ito, J., and Mester, A. (1992) “Weak layering and word binarity”. Linguistic Research
Center Working Paper LRC-​92-​09. UC Santa Cruz. Published in Honma, T.,
Okazaki, M., Tabata, T., and Tanaka, S. (eds.) (2003) A new century of phon-
ology and phonological theory: A festschrift for Professor Shosuke Haraguchi on the
occasion of his sixtieth birthday. Tokyo: Kaitakusha, pp. 26–​65.
Ito, J., and Mester, A. (2007) “Prosodic adjunction in Japanese compounds” in
Miyamoto, Y., and Ochi, M. (eds.) Formal approaches to Japanese linguis-
tics: Proceedings of FAJL 4. Cambridge, MA: MIT Department of Linguistics and
Philosophy, pp. 97–​111.
Ito, J., and Mester, A. (2009a) “The extended prosodic word” in Grijzenhout, J., and
Kabak, B. (eds.) Phonological domains: Universals and deviations. Berlin: Mouton
de Gruyter, pp. 135–​194.
Ito, J., and Mester, A. (2009b) “The onset of the prosodic word” in Parker, S. (ed.)
Phonological argumentation: Essays on evidence and motivation. London: Equinox,
pp. 227–​260.
Ito, J., and Mester, A. (2013) “Prosodic subcategories in Japanese”, Lingua, 124,
pp. 20–​40.
Kabak, B., and Revithiadou, A. (2009) “An interface approach to prosodic word recur-
sion” in Grijzenhout, J., and Kabak, B. (eds.) Phonological domains: Universals and
deviations. Berlin: Mouton de Gruyter, pp. 105–​134.
Kubozono, H. (1993) The organization of Japanese prosody. Ph.D. Diss., University
of Edinburgh. Tokyo: Kurosio Publishers.
Kubozono, H. (1989) “Syntactic and rhythmic effects on downstep in Japanese”,
Phonology, 6(1), pp. 39–​67.
Ladd, D. R. (1986) “Intonational phrasing: The case for recursive prosodic structure”,
Phonology Yearbook, 3, pp. 311–​340.
Ladd, D. R. (1988) “Declination ‘reset’ and the hierarchical organization of
utterances”, Journal of the Acoustical Society of America, 84, pp. 530–​544.
McCawley, J. D. (1968) The phonological component of a grammar of Japanese. The
Hague, The Netherlands: Mouton.
Myrberg, S. (2013) “Sisterhood in prosodic branching”, Phonology, 30(1), pp. 73–​124.
Nespor, M., and Vogel, I. (1986) Prosodic phonology. Dordrecht: Foris.
Pierrehumbert, J., and Beckman, M. (1988) Japanese tone structure. LI Monograph
Series No. 15. Cambridge, MA: MIT Press.
Prince, A. S. (2000) Comparative tableaux. Rutgers Optimality Archive. ROA-​376.
Retrieved 15 August 2015 from http://​​.
Prince, A. S., and Smolensky, P. (2004) Optimality theory: Constraint interaction in gen-
erative grammar. RuCCS-​TR-​2. Rutgers University, Brunswick, NJ, and University
of Colorado, Boulder. Malden, MA: Blackwell.
Schreuder, M. (2006) Prosodic processes in language and music. Ph.D. Diss.,
Rijksuniversiteit Groningen, The Netherlands.
Schreuder, M., and Gilbers, D. (2004) “Recursive patterns in phonological phrases”
in Bel, B., and Marlien, I. (eds.) Proceedings of Speech Prosody 2004. Nara,
Japan: ISCA (SproSig), pp. 341–​344.
Selkirk, E. (1984) Phonology and syntax: The relation between sound and structure.
Cambridge, MA: MIT Press.
Selkirk, E. (1986) “On derived domains in sentence phonology”, Phonology, 3,

274 Junko Ito and Armin Mester

Selkirk, E. (1996) “The prosodic structure of function words” in Morgan, J. L., and
Demuth, K. (eds.) Signal to syntax. Mahwah, NJ: Lawrence Erlbaum, pp. 187–​213.
Selkirk, E. (2009) “On clause and intonational phrase in Japanese: The syntactic
grounding of prosodic constituent structure”, Gengo Kenkyuu, 136, 35–​73.
Selkirk, E. (2011) “The syntax-​phonology interface” in Goldsmith, J. A., Riggle, J., and
Yu, A. (eds.) The handbook of phonological theory. 2nd edition. Oxford: Blackwell,
pp. 435–​485.
Selkirk, E., and Elordieta, G. (2010) “The role for prosodic markedness constraints
in phonological phrase formation in two pitch accent languages”. Handout of
paper presented at Tone and Intonation in Europe (TIE) 4, Stockholm University,
Department of Scandinavian Languages. Retrieved 15 August 2015, www.hum.​pub/​jsp/​polopoly.jsp?d=13236&a=73666.
Shinya, T., Selkirk, E., and Kawahara, S. (2004) “Rhythmic boost and recursive minor
phrase in Japanese” in Bel, B., and Marlien, I. (eds.) Proceedings of Speech Prosody
2004. Nara, Japan: ISCA (SproSig) pp. 345–​348.
Vance, T. J. (2008) The sounds of Japanese. Cambridge: Cambridge University Press.
Wagner, M. (2005) Prosody and recursion. Ph.D. Diss., Massachusetts Institute of
Wagner, M. (2010) “Prosody and recursion in coordinate structures and beyond”,
Natural Language and Linguistic Theory, 28, pp. 183–​237.

Prosodic studies of two
Chinese dialects
Hongming Zhang

10.1 Introduction
The nature of the syntax-​ phonology interface is of crucial importance
to prosodic phonology and it involves two fundamental problems: (i) how
accessible is syntactic information to phonological processes and (ii) what
grammatical properties are relevant to phonology? Looking back at the his-
tory of syntax-​phonology interface research, we have seen it divided into two
phases: the phase before the Optimality Theory (hereafter OT) and the phase
after OT. This chapter discusses some interface issues through the case studies
of Xiamen and Pingyao, two dialects of Chinese, and tries to prove that OT
fails to capture the nature of tone sandhi (hereafter TS) in the cases of both
Xiamen and Pingyao by brutal force or ad-​hoc constraints, and that interface
theory within the OT framework does not have explanatory power superior
to that of the theory proposed before the OT era. Section 10.2 of the chapter
presents the theoretical background, Sections 10.3 and 10.4 center on the case
study of Xiamen Chinese and Pingyao Chinese, respectively, Section 10.5
summarizes the discussion, and Section 10.6 offers the conclusion.

10.2 Prosodic studies in OT

Along with the birth of OT in the 1990s (McCarthy and Prince 1993; Prince and
Smolensky 1993), which abandons derivation-​based phonology and embraces
a constraints-​based approach, came a new perspective in prosodic studies.
Within the OT framework, prosodic phonology follows its basic principles,
allowing constraint violability, which is a language-​universal property, while
the different ranking of constraints reflects a language-​specific property. The
mapping between syntax and phonology is interpreted as different ranking
constraints in a hierarchy of different grammatical categories. Different
ranking of constraints leads to different prosodic units in natural languages.
Therefore, prosodic studies with an OT framework are thought able to predict
various types of prosodic units of different languages.
The constraints in prosodic studies with an OT framework can be divided
into three types: constraints mapping between syntax and phonology, prosodic

276 Hongming Zhang

constraints, and morphosyntactic constraints, with the first type being faith-
fulness constraints while the other two are markedness constraints.
In the spirit of the Generalized Alignment Theory of McCarthy and Prince
(1993), Selkirk (1996) defines a class of constraints on edge alignment of syn-
tactic units with phonological units, as in (1) and (2).

a. ALIGN (Xo, L; ω, L)
b. ALIGN (Xo, R; ω, R)

a. ALIGN (XP, L; φ, L)
b. ALIGN (XP, R; φ, R)

Condition (1) indicates that each lexical word needs to have its left or right
edge align with the left or right edge of the prosodic word. Constraints in
(2) state that the right or left edge of any XP in the morphosyntactic struc-
ture coincides with the right or left edge of some phonological phrase in the
prosodic structure.
Align constraints require that the edge of syntactic structure matches that
of the prosodic category. They target a group of conditions, thus covering
different levels of the mapping between syntax and phonology. These two
constraints were later referred to as “ALIGN-​XP, L” and “ALIGN-​XP, R” or
“ALIGN-​XP” in Truckenbrodt (1995, 1999), asking that the right edge of a syn-
tactic phrase align with the right edge of a phonological phrase, and the left
edge of a syntactic phrase align with the left edge of a phonological phrase.
Alignment Theory is actually another version of Edge-​based approach (Chen
1985, 1987; Selkirk 1986) based on interpreting it in OT terms.
Another theory related to the study of interface is the Wrapping Theory
within the OT framework (Truckenbrodt 1999). It proposes wrapping each
syntactic phrase within a phonological phrase: Wrap (XP; φ). Both alignment
constraint and wrapping constraint are typically the constraints of interface
between syntax and phonology. Moreover, there is Match Theory (Selkirk
2006, 2009, 2011), as given in (3).

(3) Match Theory of Syntactic-​prosodic Constituency Correspondence:

a. Match (α, π) [S-​P faithfulness]
b. Match (π, α) [P-​S faithfulness]
c. Match Clause
    A clause in a syntactic constituent structure must be matched by
a corresponding prosodic constituent, call it ι, in phonological
d. Match Phrase
    A phrase in a syntactic constituent structure must be matched by
a corresponding prosodic constituent, call it φ, in phonological

Prosodic studies of two Chinese dialects 277

e.   Match Word
    A word in a syntactic constituent structure must be matched by
a corresponding prosodic constituent, call it ω, in phonological

These match constraints call for the constituent structures of syntax and
phonology to correspond. This predicts a strong tendency for phonological
domains to mirror syntactic constituents. The view to be argued is that the
phonological constituent structure produced for individual sentences in indi-
vidual languages is the result of syntactic constituency-​ respecting match
constraints. Moreover, in identifying distinct prosodic constituent types
(ι, φ, ω) to correspond to the designated syntactic constituent types, the
Match Theory embodies the claim that the grammar allows the fundamental
syntactic distinctions between clause, phrase, and word to be reflected in the
phonological representation.
In addition to the constraints discussed in the OT framework, four general
constraints that are entailed by the Strict Layer Hypothesis are also proposed
(Selkirk 1995), as given in (4).

(4) Strict Layer Hypothesis in Optimality Theory:

(where Cn = some prosodic category)
a. Layeredness: no Ci dominates a Cj, iff j > i (e.g., no ơ dominates a Σ)
b. Headedness: any Ci must dominate a Ci-​1 (e.g., a ω must dominate
a Σ)
c. Exhaustivity: no Ci dominates Cj, iff j < i-​1, (e.g., no ω immediately
dominates a σ)
d. Non-​recursivity: no Ci dominates Cj, iff j = i (e.g., no Σ dominates
a Σ)

The Strict Layer Hypothesis is the well formedness condition for the tree
diagram of prosodic hierarchy, and determines the organization of the pros-
odic hierarchical structure and constrains the prosodic constituents that serve
as domains of phonological rules application. The structural relationship
existing among different prosodic constituents has a direct bearing on the
construction principles of prosodic hierarchy.
Within the OT framework, there are many constraints for different types
of prosodic units. For instance, BinMin (φ, ω) requires that a phonological
phrase contain at least two prosodic words, while BinMax (φ, ω) demands
that a phonological phrase can be formed by two prosodic words at the most.
These two constraints can be summed up as follows in (5).

(5) a. WRAP-​XP, which requires that each XP is contained in the same

phonological phrase;
b. NONREC, which demands that phrases are not recursive (Truckenbrodt
1995, 1999);

278 Hongming Zhang

c. ALIGN-​FOC, which refers to each focused constituent being aligned with a

phonological phrase boundary (Truckenbrodt 1995, 1999);
d. BINARY-​MAP, which means that a major phrase must consist of at most
and/​or at least two minor phrases (Selkirk 2000; Prieto 2005, 2006);
e. UNIFORMITY, which requires that a string is ideally parsed into units of the
same length (Ghini 1993; Sandalo and Truckenbrodt 2002; Prieto 2005,
f. INCREASING UNITS, which indicates that phonological phrases on the
recursive side are heavier than those in the non-​recursive side (Ghini

The Align-​Wrap Theory and the Match Theory reflect the most recent
development in the study of the interface between syntax and phonology.
They present different predictions for prosodic structures. When Wrap-​XP,
Align-​XP, and non-​recursivity interact, they might predict different types
of relations between syntax and phonology (Truckenbrodt 1999). In a non-​
recursive high-​ranking language, for a VP with two internal arguments like
[NP NP V]VP, Wrap-​XP and Align-​XP can be applied to predict and derive
these three different prosodic structures, as seen below.

(6) a. (NP NP Verb)φ Wrap-​XP >> Align-​R/​L-​XP

b. (NP) (NP) Verb Align-​R-​XP >> Wrap-​XP
c. (NP) (NP Verb) Align-​L-​XP >> Wrap-​XP

In (6a), the ranking of Wrap-​XP is higher than that of Align-​XP, VP cor-

responds to a phonological phrase, and there is no other phonological phrase
within VP. If any constraints of Align-​XP, such as Align-​XP-​L or Align-​XP-​
R, cause its ranking to be higher than that of Wrap-​XP, the whole VP phrase
will have no phonological phrase to respond. Therefore, the Match Theory
seems able to derive only one type of prosodic structure: ((NP)φ (NP)φ
Verb)φ. However, further cross-​linguistic study is needed to determine which
theory the domain of rule application sensitive to phonologic phenomena
really supports.

10.3 Xiamen Chinese: Case study (1)

10.3.1 Xiamen tonal system

There are seven citation tones in Xiamen, a southern Min dialect of Chinese,
as shown in (7).1

(7) The Citation Tones in Xiamen:

a. 44 b. 53 c. 21 d. 22 e. 24 f. 5 g. 3

Prosodic studies of two Chinese dialects 279

The TS rule and the mode of rule application in Xiamen at the phrasal
level are stated in (8) and (9), respectively.2

(8) Tone Sandhi Rule (TSR):

T → T’ /​_​_​_​T]α

(9) The Mode of TSR:

a. Free Syllable:

44 21
b. Checked Syllable:3
(i) 5 →  21 (-​p, -​t, -​k)
       21 (-​ q)
(ii) 3 →   5 (-​p, -​t, -​k)
      53 (-​ q)

Generally speaking, there is a process whereby each citation tone assumes

a sandhi form in a sort of chain shift. The “free” syllable tones form a closed
circle, as depicted in (9a). “Checked” syllable tones form a subsystem of their
own, and the rules are given in (9b). If the phonetic details are disregarded,
both “free” and “checked” syllable TS can be generalized as (8).

10.3.2 OT analysis of Xiamen TS

Xiamen Chinese has rich and complicated TS phenomena, which have
been thoroughly studied (Cheng 1968, 1973, 1991; Chen 1985, 1987, 1992,
2000; Chung 1989; Hsiao 1991; Hsu 1992; Zhang 1992; Lin 1994). However,
Truckenbrodt (1999) analyzes the domain of Xiamen TS within the OT
framework. According to him, the ranking of prosodic constraints for the
phonological phrase in Xiamen is Wrap-​XP >> Align-​R-​XP. Therefore, its
problems are tackled first with the application of OT before they are solved
with some non-​OT approaches.
First, we need check whether the Wrapping Theory and the Alignment
Theory can explain the inconsistency between AvP as the adjunct of the sen-
tence and AvP as the adjunct of the VP.4 Below is the AvP as the adjunct of
the VP.5

(10) Syntactic structure: [[yi]DP[[yi-​king]AvP = tsau]VP = a]IP

Prosodic struture:   (yi    yi-​king     tsau)φ
            he    already     go ASP
            ‘He has already left.’

280 Hongming Zhang

[[yi]DP[[yi-​king]AvP tsau]VP a]IP Exhausitivity Wrap-​XP Align-​R *P-​phrase

a. (yi yi-​king tsau)φ * *
b. (yi yi-​king)φ (tsau)φ *! **
c. (yi)φ (yi-​king tsau)φ * **!
d. (yi)φ (yi-​king)φ (tsau)φ *! ***
e. yi (yi-​king tsau)φ *! * *

In tableau (10), the AvP yi-​king in candidates (b) and (d) has no
corresponding phonological phrase on its right. And moreover, because yi is a
DP, which belongs to the functional category, it works only with lexical items
instead of functional items by the constraint of Lexical Category Condition
(LCC) (Truckenbrodt 1999). Having no corresponding phonological phrase
label on the right of DP does not violate any interface conditions. Therefore,
these two candidates both fulfill the constraint requirement of Align-​ R.
However, because the AvP yi-​king combines with the following verb tsau to
form a VP, this VP should be analyzed as a phonological phrase by Wrap-​XP,
and therefore, both of these candidates get eliminated because they violate
Wrap-​XP, which should have a higher ranking. As for candidates (a), (b), and
(c), none of their AvP yi-​king has a corresponding phonological phrase on the
right, thus violating the constraint of Align-​R, but since all of them meet the
constraint of Wrap-​XP, they come out even, if without any other constraints.
Due to the fact that its yi is not considered in analysis at the prosodic phrasal
level, candidate (10e) violates the constraint and gets eliminated. As can be
seen, each of the phonological phrases listed violates *P-​phrase, that is, the
markedness constraint, at least once. If compared with (10a), (10c) obviously
violates the *P-​phrase constraint more seriously. Therefore, (10a) stands out
as the optimal form and the winner owing to the fact that the whole part in
(10a) can be analyzed as one phonological phrase. The non-​recursivity con-
straint cannot be violated in Xiamen TS. For the candidates given below,
those violating the constraints of Recursivity and Exhaustivity and causing
an unnecessary increase in the number of phonological phrases have already
been eliminated.
Analyses of AvP as the adjunct of the sentences are given below.

(11) Syntactic structure: [[yi]DP[[tai-​k’ai]AvP # tsau]VP = a]IP

Prosodic struture:    (yi   tai-​k’ai)φ     (tsau)φ
          he   probably    go   ASP
         ‘He has probably left.’

[[yi]DP [tai-​k’ai]AvP # [tsau]VP a]IP Wrap-​XP Align-​R

a. (yi tai-​k’ai)φ (tsau)φ
b. (yi tai-​k’ai tsau)φ *!

Prosodic studies of two Chinese dialects 281

In candidate (11a), AvP tai-​k’ai corresponds to the phonological phrase on
its right, thus meeting the constraint of Align-​R. Its yi is a DP, which is a func-
tional element with the maximum projection and that has no corresponding
relation to the edge of the phonological phrase on its right, thus tallying with
the constraint of Align-​R.
With the whole part containing the maximum projection of a functional
word and, thus, not being covered within the domain of a phonological phrase,
AvP tai-​k’ai in (11b) does not correspond on its right to the right edge of the
phonological phrase. The result of this assessment is that candidate (11a) is
the optimal form with its whole part analyzed in two phonological phrases.
Thus, it can be seen that with the ranking of the constraint hierarchy: Wrap-​
XP >> Align-​R, the Wrap-​Align theory can explain the difference between the
sentential adjunct and the VP adjunct in Xiamen while defining the domains
of TS. However, if the XP within a Wrap-​XP is not an adjunct, we have got to
place restrictions on the types of this XP. In other words, we need to exclude
such XPs as those belonging to an empty category defined by LCC, functional
XPs, as well as those lexical phrases like NP, VP, AP, and so forth, which are
embedded with functional phrases.
Now let us apply the Match Theory to the case of the Xiamen TS domain
and see if it can explain the prosodic structure of Xiamen at the level of the
phonological phrase.
However, let us check on AvP as the VP adjunct in (12) first.

(12) a. Syntactic structure: [[yi]DP [[yi-​king]AvP = tsau]VP = a]IP

b. Prosodic struture:     yi   ( (yi-​king)φ  tsau)φ Match (XP; φ)
           he   already      go    ASP
         ‘He has already left.’

Now, let us take a look at AvP as the adjunct of the sentence in (13).

(13) a. Syntactic structure: [[yi]DP [[tai-​k’ai]AvP # tsau]VP = a]IP

b. Prosodic structure: yi ( (tai-​k’ai)φ  tsau)φ   Match (XP; φ)
           he   probably    go ASP
          ‘He has probably left.’

As far as the TS group is concerned, the AvP used to modify the VP differs
from that used to modify the sentence. (a) in (12) and (13) are syntactic
structures, while (b) is a recursive prosodic structure, that is, a phonological
phrase. A syntactic phrase gets matched to a phonological phrase with its syn-
tactic DP and IP being eliminated by the Match condition (XP; φ).
The data of Xiamen TS can only help define the right edge of phonological
phrases, which means if a monosyllabic word keeps the form of its citation
tone unchanged in TS, the right edge of this word will be the right edge of
the phonological phrase. However, Xiamen TS cannot define the left edge
of phonological phrases. While Match Theory requires that the phonological

282 Hongming Zhang

phrase and the syntactic phrase correspond to each other on both the right
and left edges, it differs in essence from other theories such as the Wrap-​
Align theory and the Edge-​based theory in terms of defining phonological
phrases. For instance, yi yi-​king tsau is predicted by the Wrap-​Align theory
as (yi yi-​king tsau)φ, which is a phonological phrase, as seen in (10) [[yi]DP
[[yi-​king]AvP = tsau]VP = a]IP ‘he has already left’. But the same structure gets
analyzed by the Match Theory into two phonological phrases “yi ((yi-​king)φ
tsau)φ” in (12), with one of them dominating the other, which is a recursive
prosodic structure. And moreover, the pronoun yi is not considered within the
domain of any phonological phrases.
Thus, it can be seen, the domain of Xiamen TS is a phonological phrase,
the defining of which needs to refer to the right edge of the syntactic phrase.
To apply to Xiamen TS, the Wrap-​Align theory needs to rank Wrap-​XP before
Align-​R as well as to place restrictions on the types of XPs in the constraint
of Wrap-​XP. Match Theory faces the similar restriction in Xiamen on the XPs
in its Match (XP; φ), and such XPs cannot be functional phrases, or an empty
category, or the syntactic phrases embedded with the functional phrases.
It is true that Xiamen provides support to the proposal that the prosodic
structure can be derived with reference only to one edge of the syntactic
structure, thus leading to the Edge-​based theory (Chen 1985, 1987; Selkirk
1986). The Wrap-​Align theory under OT’s framework in fact only furthers
the idea behind the Edge-​based theory because both, in essence, claim that
the defining of a phonological phrase needs to refer to only one edge (i.e.,
either right or left) of the syntactic phrase, with different languages choosing
different parameters to define their phonological phrases.
However, Xiamen TS can help only with defining the right edge of the
phonological phrase, not the left. Since the Match theory requires that a
phonological phrase and a syntactic phrase correspond to each on both the
right and left edges, it differs from the Wrap-​Align theory and the Edge-​based
theory fundamentally with regard to the prediction of phonological phrases.

10.3.3 Non-​OT analysis of Xianmen TS Chen’s TG formation

According to functional relations, Chen proposes Tone Group Formation
(TG formation) for Xiamen TS, as seen in (14).

(14) TG Formation (Chen 1985, 1987):

Mark the right edge of every XP with #, except where XP is an adjunct
c-​commanding its head.

The TG formation in (14) not only points out that Xiamen TS depends on
functional categories, but also combines two different approaches, namely,
the end-​based approach proposed by Selkirk (1986) and the relation-​based

Prosodic studies of two Chinese dialects 283

approach suggested by Kaisse (1985). According to Chen, three conditions
need to be taken into consideration in order to ascertain the domain of
Xiamen TS: edge condition, adjunct/​argument dichotomy condition, and
c-​command condition.
Since Reinhart (1981) discussed in detail the notion of c-​command, two
different definitions have been proposed: a) the preliminary definition given
by Reinhart, and b) the revised definition proposed by Chomsky (1986), given
respectively in (15a) and (15b).

(15) a. Preliminary definition:

   α c-​commands β iff
   every branching node dominating α dominates β.
b. Revised definition:
   α c-​commands β iff
   every maximal projection dominating α dominates β.

To distinguish these two different c-​command definitions, (15a) is generally

called c-​command, while (15b) is termed m-​command. It should be noted that
the notion of c-​command, according to Chen, is in fact the preliminary defin-
ition of c-​command, according to Reinhart.
However, as noticed by Chen himself, the TG formation in (14) fails
to explain why the adjunct within VP differs from a sentential adjunct. In
Xiamen, a VP-​adjunct cannot form its own TS domain; instead, together with
its following head it forms one domain, as seen in (16). A sentential adjunct,
on the other hand, must have its own domain, as seen in (17).

(16) a. Ting sio-​tsia   yi-​king tsau a

     33   55-​53 # 55-​33 = 53 n
   Ting   miss   already  go ASP
‘Miss Ting has already left.’
b. Ting sio-​tsia   kuah-​kin tsiaq png
     33   55-​53    # 55-​55 =   21    33
   Ting   miss    quickly    eat meal
‘Miss Ting quickly ate her meal.’

(17) a. Ting sio-​tsia tai-​k’ai tsau a

     33  55-​ 53   # 21-​21 # 53 n
   Ting miss   probably go ASP
‘Miss Ting has probably left.’
b. Ting sio-​tsia tai-​k’ai  yi-​king tsau a
     33  55-​ 53   # 21-​21 # 55-​33 = 53 n
   Ting    miss probably already go ASP
‘Miss Ting has probably already left.’

284 Hongming Zhang

By virtue of the TG formation in (14), if an adjunct c-​commands its head,
a TG boundary ‘#’ cannot be inserted. According to the definition of c-​
command, both yi-​king ‘already’ in (16a) and tai-​k’ai ‘probably’ in (17a) c-​
command the closely following tsau ‘go’, but only the former forms one TG
with tsau, while the latter and the following tsau form two different TGs. And
this fact shows that there is some problem within the TG formation in (14). Domain-​c-​command approach to Xiamen TS

After Chen (1987), Chung (1989) conducted a different analysis. Following
Kaisse’s idea (1985), he considered the domain of TS an m-​command domain
with the K-​condition instead of functional relations. The general idea of
Kaisse’s hypothesis is seen in (18a) and her definition of domain c-​command
is given in (18b).

(18) a. K-​condition (Kaisse 1985):

For a rule to apply to a sequence of two words α and β
(i) α must domain-​c-​command β or
(ii) β must domain-​c-​command α.
b. Domain c-​command (Kaisse 1985):
In the structure [Xmax … x …], Xmax is defined as the domain
of x. Then X c-​commands any Y in its domain.

Kaisse’s domain-​c-​command definition is, in fact, a refurbished version of

that of m-​command by Chomsky (1986). According to the K-​condition in
(18a), the TS rule applies between α and β so long as they stand in a head-​XP
relation, where the XP is neutral between argument and adjunct.
However, Chung’s analysis can solve the contradiction between (16)
and (17) because the VP-​adjunct’s position in the syntactic tree is different
from that of the sentential adjunct. The former is within the VP and is
m-​commanded by the head of the VP, namely, the verb, as seen in (19). But
the latter is outside the VP, and, thus, not m-​commanded by the verb, as seen
in (20).

(19) IP




VP-adjunct V NP
(Note: V m-commands AP.)

Prosodic studies of two Chinese dialects 285

(20) IP





sentential-adjunct V NP
(Note: V does not m-command AP.)

Since yi-​king ‘already’ in (16a) is a VP-​adjunct m-​commanded by the

verb, the TS rule applies. But tai-​k’ai ‘probably’ in (17a) is a sentential
adjunct, which is not m-​commanded by the verb, so the TS rule does not
apply. However, Chung’s analysis cannot explain cases in which the verb
and the preceding PP are divided into two different TGs in Xiamen, as seen
in (21).

(21) a. Ting sio-​tsia ti hak-​hau tsiaq png

   33   55-​53 # 21 3-​33      # 21   33
Ting    miss    at school     eat   meal
‘Miss Ting eats her meal at school.’
b. Ting sio-​tsia kuah-​kin ti hak-​hau tsiaq png
  33  55-​ 53   # 55-​
55 = 21   3-​33 # 21 33
Ting   miss   quickly   at school   eat meal
‘Miss Ting ate her meal quickly at school.’

In the syntactic tree, the PP ti hak-​hau ‘at school’ is m-​commanded by the

verb tsiaq ‘eat’, as seen in (22).

(22) IP





ti hak-hau tsiaq png


286 Hongming Zhang

According to the K-​condition, tsiaq m-​commands hak-​hau, so the TS rule
should apply between them. But as a matter of fact, this is a wrong TS output
for Xiamen. Therefore, for Xiamen TS, Kaisse’s hypothesis in (18), employed
by Chung, is not a successful one. Revised TG formation

In order to solve the problem remaining after Chen (1987) and Chung (1989),
Chen (1992) revised the TG formation for Xiamen, as shown in (23).

(23) Revised TG Formation (Chen 1992):

Mark the right edge of every XP with #, except where
XP is an adjunct c-​commanding its lexical head.

Compared with the preliminary version in (14), the revised version in (23)
also considers that functional relations with the head, instead of m-​command,
are the key to the Xiamen TS. Different from (14), (23) emphasizes that the
adjunct only c-​commands its lexical heads, not all of its heads. Since a sen-
tential adjunct is licensed by I (Infl), which is the head of a functional cat-
egory, it is a non-​lexical head; thus, the TS rule must be blocked between a
sentential adjunct and its following elements, although the sentential adjunct
c-​commands its following elements. But the adjuncts within the VP and NP
are different because both of them modify lexical heads, and, thus, the TS rule
must be applied between adjuncts and their heads. As for the cases in which
the TS rule must be blocked between the PP and the closely following verb,
according to Chen (1992), the NP (i.e., the XP between the P and verb) is an
argument rather than an adjunct, although the PP is the adjunct of the verb,
thus blocking the TS rule, as seen in (24).

(24) VP


[P [NP]ARG # ]adjunct V NP

Thus, it can be seen that the revised version in (23) by Chen not only solves the
problem in (16) and (17) but also works out a solution for the problem in (21). Re-​revised TG formation

However, the hypothesis in (23) still contains some problems. First, let us con-
sider the examples from (25) to (30).

(25) a. tso tsit ts’ut liok-​ yah-​ p’ih lai k’uah

   33     3   5   =    3 - ​ 55 -​  21 #      33   21
   rent one   Cl  video-​movie    to   watch
   ‘Rent a video movie to watch’

Prosodic studies of two Chinese dialects 287

b. liok-​yah-​p’ih tso tsit ts’ut lai k’uah

   3 -​ 55 -​ 21     # 33   3 5   = 33 21
video-​movie rent one Cl      to watch
‘Rent a video movie to watch’

(26)  a. bue tsap kuah be-​ a tsiu lai lim

   55    3   53 = 21-​55     53  #  33 55
buy ten Cl    beer   wine to   drink
‘Buy ten bottles of beer to drink’
b. be-​a tsiu   bue tsap kuah lai  lim
21-​55 53 # 55    3   53 = 33  55
beer wine buy   ten   Cl    to drink
‘Buy ten bottles of beer to drink’

(27) a. tso tsit ts’ut liok-​ yah-​p’ih tsin kui

     33 3   5 = 3 - 55 -​ 21 # 33 21
rent one Cl  video-​ movie very expensive
‘It is very expensive to rent a video movie.’
b. liok-​yah-​p’ih   tso tsit ts’ut tsin kui
     3 -​ 55 -​  21 #    33 3   3   # 33 21
video-​movie   rent one Cl very expensive
‘It is very expensive to rent a video movie.’

(28) a. lim tsap kuah be-​ a tsiu  e  tsui

  33    3   53 = 21-​55 53 # 21 21
drink ten   Cl   beer wine will drunk
‘To drink ten bottles of beer will cause drunkenness.’
b. be-​ a tsiu lim tsap kuah e tsui
55 53 # 33  3     21 # 21 21
    beer wine drink ten   Cl   will drunk
‘To drink ten bottles of beer will cause drunkenness.’

(29) ts’iuh sah pai siuh t’iam

53  33 53 # 33  53
sing three Cl too tired
‘It is too tiring to sing three times.’

(30) ts’iuh tsit pai hoo yi t’iah

53   3   55 = 44 22    44
sing   one Cl   for him hear
‘Sing once for him to hear.’

Chen (1992) has conducted an analysis of case (25). In his opinion, the
adnominal adjunct QP in (15a) for the NP liok-​yah-​p’ih ‘video movie’, which
occupies an object position, is reanalyzed as an adverbial phrase as well as a

288 Hongming Zhang

post-​head adjunct in (25b) as a result of the topicalization of liok-​yah-​p’ih.
The syntactic structure given by Chen for (25b) is shown in (31).

(31) S'

Top S


V' S'


liok-yah-p’ih # tso tsit ts’ut = lai k’uah

The first question we want to ask is how ‘=’, which is put at the right edge
of the QP to symbolize the application of the TS rule, is obtained. According
to the TG formation in (23), ‘#’ should be assigned to the right edge of all
of XPs, except when an XP is an adjunct c-​commanding its lexical head, for
which an ‘=’ should be put there instead. But in (31), the QP c-​commands
only the verb tso ‘rent’ at its left without c-​commanding any elements to its
right. By Chen’s analysis, the QP seems to be an adjunct c-​commanding its left
head, thus gaining an ‘=’ at its right, although this QP does not have any c-​
command relation with its right elements. Such an analysis is also suitable for
example (26b). But this analysis violates the locality conditions (Poser 1981,
1985; Steriade 1987), which maintain that the application of the TS rule to
the right should have nothing to do with the syntactic condition to the left.
The second question is concerned with “lexical head”. According to
(23), the TS rule must be blocked between the XP and the following elem-
ents, except when the XP is an adjunct c-​commanding its lexical head. Before
discussing the problem involved in (23), let us briefly present Chinese phrase
structures first (Huang 1982, 1991; Tang 1990). In the notation of X’-​theory,
every phrasal category is a projection of a zero-​level category in terms of the
following formalization.6

(32) a. X’ = X X”*
b. X” = X”* X’

Zero-​level categories are assumed to be of two different types. One type

consists of the lexical categories, including N, V, P, and A. Another type
covers the non-​lexical or functional categories like complementizer (C) and
Infl (I). Now let us come back to the problem in (23). According to (23), an
adjunct can c-​command its lexical head, excluding a non-​lexical head or a
functional head, that is, Infl or Comp of CP (Chen 1992). But hoo ‘for’ in
(20) is the head of a functional category, that is, the Comp of CP, instead of a
lexical head, so the TS rule still applies between the QP tsit pai ‘one’ and hoo.

Prosodic studies of two Chinese dialects 289

Thus, this shows that the TG formation in (23) needs further revision. That is
why I propose here in (33) a re-​revised TG formation for Xiamen.7

(33) Re-​revised TG Formation:

Mark the right edge of every XP with #, except where XP is an
adjunct m-​commanding either its head or the head of XP on the
right except Infl.

The TG formation in (33) can account for, without any exception, all of the
data mentioned above. Adjuncts in both example (16) and (17) m-​command
their following heads, but since the head of the former is a verb while that of
the latter is an Infl, the TS rule can be applied only to (16), and is blocked in
(17), as shown respectively in (34) and (35).

(34) IP




yi-king = tsau
(Note: AP m-commands V, i.e., the head of VP.)

(35) IP





tai-k’ai # tsau
(Note: AP m-commands Infl, i.e., the head of IP.)

In example (21a), the NP hak-​hau ‘school’ is an argument, not an adjunct,

for the preposition ti ‘at’, so the TS rule is blocked between hak-​hau and tsiaq.
The example is reproduced in (36) for the sake of convenience.

290 Hongming Zhang

(36) IP





ti hok-hau # tsiaq png

(Note: The XP before V, i.e., the head of VP, is an argument.)

As for example (21b), since the adjunct kuah-​kin ‘quickly’ m-​commands

the head of PP, that is, ti ‘at’, on the right, the TS rule is applied between kuah-​
kin and ti, as seen in (37).

(37) IP



AP1 V'



kuah-kin = ti hok-hau # tsiaq png

(Note: AP2 m-commands P, the head of PP on the right.)

Now, let us consider the examples (25–​30) in accordance with the TG for-
mation in (33).
In both (25b) and (26b), the QPs, as adjuncts, m-​command the right head
lai ‘to’. Likewise, in example (30), the QP m-​commands hoo ‘for’, the head of
CP on the right. So the TS rule must be applied to (25b), (26b), and (30), in
which the heads following the QPs are all complementizers and are all heads
of CP. The syntactic structure of (30) can be repictured as (38).

Prosodic studies of two Chinese dialects 291

(38) IP

Spec I'




Spec I'



ts’iuh tsit pai = hoo yi t’iah

sing one Cl Comp he hear

As for examples (27b), (28b), and (29), their syntactic tree structures are the
same as illustrated in (39), in which the QP as an adjunct cannot m-​command
any of the elements on its right, thus blocking the TS rule.

(39) IP

Spec I'






ts’iuh sah pai # siuh t’am

Therefore, it can be seen that the TG formation in (33) can account for all
of the data here.

292 Hongming Zhang

If we compare the TG formations in (33) and (23), we will see such
differences between them as: (i) the syntactic condition of (33) is m-​command,
while the syntactic condition of (23) is c-​command; (ii) (33) is concerned only
with an adjunct’s m-​command relations to the right heads, while ignoring its
left elements (locality conditions are related to this point), but (23) sometimes
depends on the relation between an adjunct and its left head in order to decide
whether or not there is a boundary to the right of TG; and (iii) by (33), an
adjunct can m-​command all of the following heads, including C of CP, except
Infl, but by (23), an adjunct can c-​command only its lexical head, excluding
all non-​lexical heads or functional heads, that is, either Infl or Comp of CP.
One key point concerning (iii) is the fact that Infl in Chinese is a trace, that
is, one of the empty categories, in S-​structure. Based on the discussion of “A
not A” question sentences, Huang (1990) has proved that the AGR and verb
in Chinese move respectively downward from I° and upward from VP to “VP
shell”, which is located between I’ and VP. So, after head-​to-​head movement,
Infl, a head of IP, becomes a trace, as shown in (40).

(40) IP

Spec I'

I VP-shell

[AGR] Spec V'


Spec V'


t [e] t

Thus, it can be seen that the definition in (33) differs from that in (23), in
that the former maintains that the TS rule is blocked by an empty category,
while the latter holds that it is blocked by functional words. However, the TS
rule is still applicable even if functional heads on the right are m-​commanded
by an adjunct, and this has been proved by lai ‘to’ in (25b) and (26b) as well
as hoo ‘for’ in (30).

Prosodic studies of two Chinese dialects 293

10.4 Pingyao Chinese: Case study (2)

10.4.1 Tonological background

Spoken in the central part of Shanxi province in north China, Pingyao
belongs to the Jin dialect and has five citation tones (Hou 1980), as given

(41) The Citation Tones in Pingyao Chinese

Tonal Category Phonetic Value Examples

1. Ping Tone LM iŋ   “overcast”
2. Shang Tone HM ɕiɔ   “small”
3. Qu Tone MH ts’æ   “dish”
4. Yin Ru Tone LMq ʂʌʔ   “lose”
5. Yang Ru Tone HMq yʌʔ  “moon”

In connected speech, Pingyao TS is divided into two types. The tonal

sequences that emerge depend upon both the combination of citation tones
(CT) and the functional relations that hold between tone-​bearing units across
the sandhi site. All of the argument structures belong to type A (TSA), while
all of the others fall under type B (TSB). Summaries of dissyllabic tonal
patterns of TSA and TSB are given in (42) and (43), respectively.

(42) Dissyllabic TSA Patterns of Pingyao

T1 /​ T2 LM LMq MH HM HMq

(43) Dissyllabic TSB Patterns of Pingyao

T1 /​ T2 LM LMq MH HM HMq
LMb(yin) ML-​MH ML-​MHq ML-​LM ML-​HM ML-​HMq

294 Hongming Zhang

LMq MLq-​MH MLq-​MHq LMq-​LM MLq-​HM MLq-​HMq

HMq HMq-​LM HMq-​LMq HMq-​MH MHq-​HM HMq-​HMq

In the above two tables, the leftmost column and the top row show the
form of the citation tones of the first and the second syllable, respectively. The
intersections of the columns and the rows indicate the sandhi tone forms of
bi-​tonal sequences.
The tones of LMq and HMq can be considered as the allotones of LM and
HM, respectively, because they have the same TS patterns. Thus, the patterns
of TSA can be simplified as (44).


T1 /​ T2 LM MH HM

As for TSB, the tonal behavior of LM in sandhi position is divided into

two types that indicate the two different historical sources of the citation tone
LM. One is yin ping tone, and the other yang ping tone.9 The contrast between
yin ping and yang ping gets lost when merged in the citation tonal system,
but preserved at the sandhi level. In addition, the sandhi forms of LMq are
realized as falling tones just like its counterpart LM of yin ping tones except
for those marked in the shaded cell, which remain as rising tones. So, the
patterns of TSB in table (43) can be simplified as (45).


T1 /​T2 LM MH HM
LMa (yang) LM-​LM ML-​MH MH-​MLM
LMb(yin) ML-​MH ML-​LM/​LMq-​LM ML-​HM

Prosodic studies of two Chinese dialects 295

As shown by the data, Pingyao exhibits a very complicated case of TS
patterns. The mode of rules for TSA is a regressive one, and the rules of TSA
are proposed as follows.

(46) Regressive Rules for TSA:

a. LM → ML /​_​_​_​_​MH
b. LM → MH /​_​_​_​_​MLM (HM)10
c. MH → LM /​_​_​_​_​LM
d. MH → ML /​_​_​_​_​MH
e. HM → MH /​_​_​_​_​HM

The rules of TSB are more complicated. Besides regressive rules, progres-
sive rules and bidirectional rules will also be applied, as shown below.

(47) a. Regressive Rules for TSB:

i. LMa → ML /​_​_​_​_​MH
ii. LMa → MH /​_​_​__​ ​MLM (HM)
iii. LMb → ML /​_​_​_​_​HM
b. Progressive Rules for TSB:
i. LM → HM /​MH _​_​_​_​
ii. MH → HM /​MH _​_​_​_​
c. Bi-​directional Rules for TSB:
i. LMb-​LM → ML-​MH
ii. LMb-​MH → ML-​LM

This argument (TSA) versus non-​ argument (TSB) dichotomy in TS

patterns can be summarized and illustrated as (48) and (49), respectively.


LM(q) -​MH ML(q) –​MH LM(q) -​LM

MH -​LM(q) LM -​LM(q) MH -​HM(q)

(49) 耕地 豇豆
‘till soil’ ‘cowpea’
Functional type argument non-​argument
Syntactic type verb-​object (VO) modifier-​noun (MH)
Tone sandhi type type A (TSA) type B (TSB)
Citation tone LM -​MH LM -​MH
Sandhi tone ML -​MH LM –​LM

296 Hongming Zhang

The examples in (49) show that the citation tones for type A and type B are
exactly the same, but the sandhi tones are different because of the difference
in functional relations.
This fact becomes even more intriguing when we consider the effect of TS
on more complex structures exhibiting hierarchical structure and allowing for
possible interaction between TSA and TSB. Some examples show that the
internal structure is visible for rule application, since TS rules apply cyclically,
and the rule selection (TSA or TSB) depends on the functional relation that
holds for each cycle, seen as follows:


plant tree festival

‘Arbor Day’
BT LMq - MH - LMq
i. cyclic [ MLq ] by TSA
[ HMq ] by TSB

ok MLq - MH - HMq
ii. L→R [ MLq ] by TSA
[ LM ] by TSA

* MLq - LM - HMq
iii. R → L [ HMq ] by TSB
[ LM ] by TSB

* LMq - LM - HMq

As shown in (50), only the cyclic mode will bring about the correct output
form. In the derivations above, labeled brackets […]A and […]B stand for func-
tional units of type A or type B, which select for TSA or TSB respectively on
each cycle. Some other examples, however, suggest a non-​cyclic mode, seen as

Prosodic studies of two Chinese dialects 297

(51) a

‘the journey is long’

i. cyclic [ HM ] by TSB
[ NA ] by TSA

* MH - HM - LM
ii. R L [ LM ] byTSA
[ LM ] by TSA

ok LM - LM - LM


very make money

‘very lucrative’
i. cyclic [LM ] by TSA
[ NA ] by TSB

* HM - LM - LM
ii. L R [ NA ] by TSB
[ HM ] byTSB

ok HM - MH - HM

Apparently, in the cases of (50) and (51), the functional information for
internal structures is ignored. Moreover, TS rules apply iteratively, with the
functional relation holding on the outer structures that determine both the
applicable rule (TSA or TSB) and the direction of application (right to left or
left to right). Without going into the details, the overall patterns of Pingyao
TS can be laid out as follows.

298 Hongming Zhang

(52) Type A Left-branching

(A1) A (A2) A


x1 x2 x3 x1 x2 x3
--A-- --A--

--A -- --A--
(A3) A (A4) A


x1 x2 x3 x1 x2 x3
--A-- --B--
--A-- --A--

Type B Left-branching
(B1) B (B2) B


x1 x2 x3 x1 x2 x3
--A-- --B--
--B-- --B--
(B3) B (B4) B


x1 x2 x3 x1 x2 x3
--B-- --B--
--B-- --B—

The figures in (52) exhaust all logical possibilities: right/​left-​branching

structures, and A-​or B-​type grammatical constructions on the inner/​outer
cycle. The trees represent the IC hierarchy in the usual manner, with node
labels A/​B indicating the argument structure types (argument/​others), and x’s
standing for the syllables. -​-​A-​-​and -​-​B-​-​indicate which TS applies to which
pair of adjacent syllables.

Prosodic studies of two Chinese dialects 299

10.4.2 OT analysis of Pingyao TS OT analysis of disyllabic TS in Pingyao

Zhang (1999) visited the case of Pingyao TS and gave an OT analysis of both
TSA and TSB. The constraints he proposed for disyllabic TSA of Pingyao are
given in (53).

(53) Constraints for TSA of Pingyao:

a. Pres(σ2, T): Preserve the tonal property of the second syllable;
b. Pres(σ1, C): Preserve the tonal contour of the first syllable;
c. Pres(σ1, R): Preserve the tonal register of the first syllable;
d. Word Final Rise: There must be a pitch rise word finally;
e. Pres(HM): Preserve the property of a base high falling tone in the
sandhi form;
f. Num(Inf) ≤ 2: A word with 2 syllables can carry at most two tonal
inflection points;
g. Num(Inf) ≥ 1: A word with two syllables should have at least one
tonal inflection point;
h. Dur(B): A pitch rise or a sharp pitch fall is disallowed;
i. Reg(2) H: No adjacent high registers.

The ranking of the constraints for TSA is summarized in (54).

(54) 2, T), WFR, Num(Inf) 2, Num(Inf) 1


Reg(2) H

Dur(B), Pres( 1, C)

Pres( 1, R)

Some of the constraints for TSA are also available to TSB, which are
Num(Inf) ≤ 2, Num(Inf) ≥ 1, Dur(B), Pre(σ1, C), Pre(σ1, R), and Pres(HM).
Nevertheless, the constraints specifically for TSB are given in (55).

(55) Constraints for TSB of Pingyao:

a. Yin/​Yang preservation: In sandhi forms, yin tones are falling and
yang tones are rising;
b. Num(Inf) ≤ 1: A word cannot have more than one tonal inflection
c. Pre(σ2, C): Preserve the tonal contour of the second syllable;
d. Pre(σ2, R): Preserve the tonal register of the second syllable;

300 Hongming Zhang

e. Reg(2)L: Two adjacent low registers is disallowed;

f. *Non-​Lexical Tone: A tone not in the lexical inventory is not
allowed in the sandhi form.

The ranking of the constraints for TSB in Pingyao is shown in (56).

(56) Yin/Yang preservation, Num(Inf) 2, Num( Inf) 1

Pres( 1, C), Pres( 1, R), Pres(HM)

Num(Inf) 1, Reg(2)L

Dur(B), * Non-lexical Tone, Pres( 2, C)

Pres( 2, R)

However, Zhang’s constraint-​based analysis contains several problems.

First, some language-​specific constraints adopted are not compatible with
OT, by which constraints are universal. And such cases include Pres(HM) and
Dur(B), that is, yin/​yang preservations, which are unusual to the Pingyao case
and not better than an ad-​hoc stipulation. Second, Zhang’s analysis enjoys
too much freedom. For example, he treats the data in the shaded cell in table
(57) as an anomaly and leaves it without explanation.


T1 /​T2 LM LMq MH HM HMq

LMb(yin) ML-​MH ML-​MHq ML-​LM ML-​HM ML-​HMq
LMq MLq-​MH MLq-​MHq LMq-​LM MLq-​HM MLq-​HMq
HMq HMq-​LM HMq-​LMq HMq-​MH MHq-​HM HMq-​HMq

Third, because yang-​ping LM in TSB has the same TS behavior as in

TSA, he regards it as an idiosyncrasy of Pingyao. The fact is that, if the OT
framework can satisfactorily capture bi-​tonal sandhi in Pingyao, the yang-​
ping having the same sandhi behavior in TSA or TSB should not be ignored
and must be accounted for. If yang-​ping in TSB is taken into consideration,

Prosodic studies of two Chinese dialects 301

the ranking constraints proposed for TSB fail to predict all the attested
sandhi forms. OT analysis of tri-​syllabic TS in Pingyao

In tri-​tonal sequence, the TS domain (TSD) in Pingyao will undergo restruc-
turing. Let us use OT constraints to delimitate the TSD and account for the
modes of rule application first (i.e., cyclical mode versus iterative mode). The
TSD in Pingyao is disyllabic, and therefore, we could come up with the binary
constraint seen as (58).

(58) Binary: TSD must be binary under syllabic analysis.

Under this constraint, the three syllables will be parsed as either (σσ)σ or
σ(σσ), in order to prevent those unparsed structures from being chosen, and
a constraint that demands every syllable in the input be parsed into a TSD is
needed, as shown in (59).

(59) Parse σ: Every syllable should be parsed into TSD.

Ranking the Parse constraint higher than the Binary constraint, the
unparsed structures will be ruled out.
Chen (1990, 2000) discussed the directionality of TS rules for type A and
type B constructions: TS scans construction A right to left and scans con-
struction B left to right. If we redefine that constructions A and B correspond
to the phonological phrase and prosodic word respectively, the directionality
of TS in Pingyao can be rewritten because the TS rule scans a phonological
phrase from right to left and scans a prosodic word from left to right. Then,
the alignment constraints can be proposed under the OT framework, stated in
(60) and (61), respectively.

(60) Align (TSD, φ’)R: The right edge of every TS domain is aligned with
the right edge of the maximal phonological phrase.

(61) Align (TSD, ω’)L: The left edge of every TS domain is aligned with
the left edge of the maximal phonological word.

Following Ito and Mester (2012), we can refer to the larger structure of
the tri-​syllabic string as the maximal prosodic category. It should be noted
that these two alignment constraints are not dominated in the prosodic hier-
archy, and consequently, the ranking of constraints for the tri-​tonal sandhi in
Pingyao is as follows.

(62) Align (TSD, ω’)L /​Align (TSD, φ’)R >> Parse σ >> Binary

302 Hongming Zhang

The constraints ranking in (62) can predict the domain of tri-​syllabic TS,
as illustrated in (63) and (64), respectively.


[σ σ σ] φ’ Align (TSD, φ’) R Parse σ Binary

σ (σ σ) *!
(σ σ) σ *! *
(σ) (σ σ) *! *
(σ σ) (σ) *! *
 (σ (σ σ)) *
((σ σ) σ) *! *


[σ σ σ] ω’ Align (TSD, ω’) L Parse σ Binary

σ (σ σ) *! *
(σ σ) σ *!
(σ) (σ σ) *! *
(σ σ) (σ) *! *
(σ (σ σ)) *! *
 ((σ σ) σ) *

In the above tableaux, the two alignments make different decisions on

parsing the TSD. If the input of tri-​syllabic TS is a prosodic word, the TSD
will be ((σ σ) σ), while if the tri-​syllabic TS input is a phonological phrase, the
TSD will be (σ (σ σ)).
A tri-​syllabic string will form two TSDs. If the internal TSD formed by
two syllables is congruent with the intermediate prosodic category, it is the
prosodic category that determines what type of TS rule will be chosen (i.e.,
phonological phrase or prosodic word). Otherwise, it is a maximal prosodic
category. Tableaux (65)–​(68) illustrate the formation of TSD for the cases of
A1, A4, B2, and B4 in (52).

Prosodic studies of two Chinese dialects 303

(65) A1: [[σ σ]φ σ]φ’ → (σ (σ σ))φ’

[[σ σ]φ σ]φ’ Align (TSD, φ’) R Parse σ Binary

σ (σ σ) *!
(σ σ)φ σ *! *
(σ) (σ σ) *! *
(σ σ)φ (σ) *! *
 (σ (σ σ))φ’ *
((σ σ)φ σ)φ’ *! *

(66) A4: [σ [σ σ]ω]φ’ → (σ (σ σ)ω)φ’

[σ [σ σ]ω]φ’ Align (TSD, φ’) R Parse σ Binary

σ (σ σ)ω *!
(σ σ) σ *! *
(σ) (σ σ)ω *! *
(σ σ) (σ) *! *
 (σ (σ σ)ω)φ’ *
((σ σ) σ)φ’ *! *

(67) B2: [[σ σ]ω σ]ω’ → ((σ σ)ω σ)ω’

[[σ σ]ω σ]ω’ Align (TSD, ω’) L Parse σ Binary

σ (σ σ) *! *
(σ σ)ω σ *!
(σ) (σ σ) *! *
(σ σ)ω (σ) *! *
(σ (σ σ))ω’ *! *
 ((σ σ)ω σ)ω’ *

304 Hongming Zhang

(68) B4: [σ [σ σ]ω]ω’ → ((σ σ) σ)ω’

[σ [σ σ]ω]ω’ Align (TSD, ω’) L Parse σ Binary

σ (σ σ)ω *! *
(σ σ) σ *!
(σ) (σ σ)ω *! *
(σ σ) (σ) *! *
(σ (σ σ)ω)ω’ *! *
 ((σ σ) σ)ω’ *

The constraints and recursive prosodic structures proposed here can pre-
dict the TSD to account for all eight TS patterns listed in (52), through restruc-
turing. However, there are two problems in this analysis. The first problem
is the property of the alignment. Generally speaking, the term “alignment”
refers to the correspondence of different domains, that is, the correspondence
between morphosyntactic category and prosodic category. But if the Align-​
L (i.e., the domain of TS; maximal prosodic word) adopted in the analyses
considers the domain of TS a prosodic unit, the alignment constraint here will
be a correspondence between prosodic units only, rather than between mor-
phosyntactic units and prosodic units.
Another problem is the different TS behaviors of the embedded disyllabic
units in the tri-​syllables. Of the eight tri-​syllabic patterns, the performance of
the embedded disyllabic TS in the tri-​syllable presents different properties.
Some have it made up by a prosodic word with the application of TSB, some
get it consisting of a phonological phrase with the application of TSA, and
some others contain no prosodic unit, and, therefore, have their application
of the TS rule decided by the property of outer maximal prosodic units. The
situation leads to difficulty in defining the domain of the embedded disyllabic
units in the tri-​syllables as a consistent unit in the prosodic hierarchy. So, the
OT approach apparently fails to capture the TA patterns in Pingyao. Non-​OT analysis of Pingyao TS

As mentioned previously, Type A and Type B constructions in Pingyao
Chinese are different types of functional structures, determined by functional
categories. More precisely, it is the syntactic factors with functional relations
(argument structure versus non-​argument structure) that determine the appli-
cation of TS rules. Hence, I propose here a new hypothesis for Pingyao TS,
which was named the edge c-​command principle, as given below.

Prosodic studies of two Chinese dialects 305

(69) Edge C-​command Principle:

Within argument structure, TSA applies iteratively right to left
if X3 c-​commands both X2 and X1; and in non-​argument structure
where X1 c-​commands both X2 and X3, TSB applies iteratively
left to right. Otherwise, TSA/​B applies cyclically.

Since the rule of TSA applies right to left, it takes the rightmost element
X3 as the dominant element, which then determines the mode of rule appli-
cation by virtue of the c-​command condition; TSB works in the same way
as TSA, but in a different direction. As seen from the principle in (69), in
Pingyao a functional relation determines the type of TS rule (TSA versus
TSB), while a syntactic condition (c-​command) determines the mode of TS
rule application. Now let us use the principle in (69) to test all of the patterns
illustrated in (52).
In both (A1) and (A2) of (52), TSA applies iteratively right to left because
X3 c-​commands both X2 and X1, illustrated by (70a). In (A3) and (A4), since
X3 does not c-​command X1, TSA and TSB apply cyclically, seen as (70b). In
(B1) and (B2), TSA/​B applies cyclically because X1 does not c-​command X3,
as shown in (70c). In (B3) and (B4), since X1 c-​commands both X2 and X3,
TSB applies iteratively left to right, as presented in (70d).

(70) a. = (A2)

journey long

‘the journey is long’

X1 - X2 - X3
[ LM ] by TSA
[ LM ] by TSA

ok LM - LM - LM (iterative right to left)

b. = (A4)

move bed-roll
‘to move bed-roll’
X1 - X2 - X3
[ LM ] by TSB
[ NA ] by TSA

ok LM - LM - LM (cycle)

306 Hongming Zhang

c. = (B1)

plant tree festival

‘Arbor Day’
X1 - X2 - X3
BT LMq - MH - LMq
[ MLq ] by TSA
[ HMq ] by TSB

ok MLq - MH - HMq (cycle)

d. = (B3)

very make money

‘very lucrative’
X1 - X2 - X3
[ NA ] by TSB
[ HM ] by TSB

ok HM - MH - HM (iterative left to right)

The principle in (69) can explain all of the cases in (52), which shows that
Pingyao uses a typical functional/​syntactic condition, instead of neither a
foot condition as claimed by Chen (1990) nor an OT case proposed by Zhang

10.5 Discussion
The domain of rule application of Chinese TS has been a major topic in
studies on the interface between syntax and phonology. With the birth of the
OT framework, the phonological study seems to be split into two opposing
paradigms: that is, rule-​ based phonology versus constraint-​ based phon-
ology. Likewise, the interface study of syntax-​phonology also gets split into
two opposing paradigms, that is, the direct reference approach (DRA) and
the indirect reference approach (IRA). But these two oppositions are not the
same in nature. The former is caused by a different understanding about the
ontology, that is, how to interpret the nature of phonology. In other words, the
question here is whether the phonological process is a derivational process or a
constraint-​ranking process. As for the latter, it reflects the controversy over such
issues as whether syntactic information is accessible to phonological processes,
what syntactic properties are relevant to phonology, whether phonological rule
application refers to syntactic information directly or indirectly, whether syntax

Prosodic studies of two Chinese dialects 307

is sensitive or insensitive/​blind in determining the domain of phonological rule
application, and so forth. Interface studies (including Alignment Theory, Wrap
Theory, Match Theory, etc.) under the OT framework belong essentially to IRA,
although, in the later stage, they did try to incorporate syntactic information
into the phonological process. Match theory even tried to establish a kind of
direct correspondence between syntactic units and prosodic categories. But such
efforts have not changed the fact that DRA and IRA essentially oppose each
other essentially. Precisely speaking, DRA and IRA are not appropriate as tech-
nical terms because they can be misleading due to the fact that the essential
difference between these two approaches is not whether the phonological pro-
cess is directly or indirectly sensitive to the syntactic information. Both of them,
as a matter of fact, need to have the phonological process directly sensitive to
the syntactic information. The major difference between them actually lies in
(i) whether a prosodic structure is required, and (ii) what syntactic information is
needed during the phonological rule application. Syntactic information needs to
include at least such things as syntactic units (i.e., morpheme, affix, stem, word,
phrase, etc.), syntactic categories (i.e., NP, VP, AP, PP, etc.), syntactic relations
(i.e., c-​command, m-​command, binding, etc.), and so on, but IRA actually cares
about only such syntactic information as syntactic units and syntactic categories,
while DRA is concerned with syntactic relations. This choice of different kinds
of information is what really and most importantly contrasts DRA with IRA.
As for interface study within the OT framework, it is very much like IRA in
terms of caring only about syntactic units and syntactic categories while ignoring
syntactic relations. Therefore, the issues that IRA is faced with are exactly the
same issues that interface studies conducted under the OT framework are unable
to solve. Take the cases of Xiamen and Pingyao, for example. The data of Xiamen
TS can only help define the right edge of phonological phrases, which means that
if a monosyllabic word keeps the form of its citation tone unchanged in TS, the
right edge of this word will be the right edge of the phonological phrase. However,
Xiamen TS cannot define the left edge of phonological phrases. But Match
Theory requires that the phonological phrase and syntactic phrase correspond to
each other on both right and left edges. For instance, yi yi-​king tsau ‘he has already
left’ [[yi]DP [[yi-​king]AvP = tsau]VP = a]IP is predicted by the Wrap-​Align Theory as
one prosodic unit, that is (yi yi-​king tsau)φ, which is a phonological phrase. But
the same structure gets analyzed by Match Theory into two phonological phrases
“yi ((yi-​king)φ tsau)φ”, with one of them dominating the other, which is a recursive
prosodic structure. Moreover, the pronoun yi is not considered within the domain
of any phonological phrases. Thus, it can be seen that the domain of Xiamen TS
is a phonological phrase, the defining of which needs to refer to the right edge of
the syntactic phrase plus m-​command condition. Both the Wrap-​Align Theory
and Match Theory need to rank Wrap-​XP before Align-​R as well as to place
restrictions on the types of XPs in the constraint of Wrap-​XP/​(XP; φ).
The OT approach also fails in the Pingyao case. Pingyao has two TS rule
types based on the difference in morphosyntactic constructions: type A (TSA)
and type B (TSB). In tri-​tonal sequence, the TS domain (TSD) in Pingyao

308 Hongming Zhang

undergoes restructuring. Let us use OT constraints to delimitate the sandhi
domain and account for the modes of rule application first (i.e., cyclical mode
versus iterative mode). Under OT constraints, the three syllables will be parsed
as either (σσ)σ or σ(σσ), and in order to prevent those unparsed structures
from being chosen, a constraint that demands every syllable in the input being
parsed into a sandhi domain is needed. Chen (2000) discussed the direction-
ality of TS rules for Type A and Type B constructions: TS scans construc-
tion A right to left and scans construction B left to right. If we redefine that
constructions A and B correspond to a phonological phrase and prosodic
word respectively, the directionality of TS in Pingyao can be rewritten because
the TS rule scans a phonological phrase from right to left and scans a prosodic
word from left to right. Following Ito and Mester (2012), the larger structure
of a tri-​syllabic string can be termed the maximal prosodic category. However,
it should be noted that the alignment constraints in Pingyao are not dominated
in the prosodic hierarchy, and consequently, the ranking of constraints for the
tri-​tonal sandhi in Pingyao is the following: [Align (TSD, ω’)L /​Align (TSD,
φ’)R >> Parse σ >> Binary]. And the two alignments make different decisions
on parsing the sandhi domain. If the input of tri-​syllabic TS is a prosodic
word, the sandhi domain will be ((σ σ) σ), while if the tri-​syllabic TS input is a
phonological phrase, the sandhi domain will be (σ (σ σ)). A tri-​syllabic string
will form two TSDs. If the internal TS domain formed by two syllables is con-
gruent with the intermediate prosodic category, it is the prosodic category that
determines what type of TS rule will be chosen (i.e., phonological phrase or
prosodic word). Otherwise, it is a maximal prosodic category. Although the
constraints and recursive prosodic structures can predict the sandhi domains
so as to account for all TS patterns in Pingyao through restructuring, two
problems still remain in this OT analysis. The first problem is the property of
the alignment. Generally speaking, the term “alignment” refers to the corres-
pondence of different domains, that is, the correspondence between the mor-
phosyntactic category and the prosodic category. But if the Align-​L (i.e., the
sandhi domain; maximal prosodic word) adopted in the analyses considers
the domain of TS a prosodic unit, the alignment constraint here will have a
correspondence between prosodic units only, rather than between morphosyn-
tactic units and prosodic units. Another problem is the different TS behaviors
of the embedded disyllabic units in tri-​syllables. Of all the tri-​syllabic patterns
in Pingyao, the performance of the embedded disyllabic TS in tri-​syllables
presents different properties. Some have it made up by a prosodic word with the
application of TSB, some get it consisting of a phonological phrase with the
application of TSA, and some others contain no prosodic unit and, therefore,
have their application of the TS rule decided by the property of outer maximal
prosodic units. This situation leads to difficulty in defining the domain of the
embedded disyllabic units in tri-​syllables as a consistent unit in prosodic hier-
archy. Thus, the OT approach fails to capture the TS patterns in Pingyao by
brutal force or ad-​hoc constraints.

Prosodic studies of two Chinese dialects 309

To sum up, with some of tone sandhi data discussed here, I have shown
that Match Theory fails to make the correct prediction for the sandhi phe-
nomena in Xiamen Chinese and Pingyao Chinese. This is because the the-
ories within the OT framework (such as the Align-​Wrap theory and Match
Theory) are all derived from the Indirect Reference theory, but TS in some
Chinese dialects seems to support the Direct Reference theory (such as the c/​
m-​command condition).

10.6 Concluding remarks

Prosody is one of the core components of language and speech, which
indicates the information about syntax, turn-​taking in conversation, and
types of utterance, such as questions or statements, as well as speakers’
attitudes and feelings. Prosody plays an important role in human speech per-
ception. However, the interface between prosody and syntax/​morphology has
remained a controversial area in prosodic studies. It is widely observed that
phonological structure is sensitive to syntactic structure, but what elements
of phonological structure and how the phonological structure are influenced
by syntactic structure is still open to debate. The research into Asian tonal
languages, such as Chinese dialects, has played a significant but often unappre-
ciated role in uncovering the significance of prosody. These studies not only
deepen our understanding of Chinese prosody but also present important
data for theoretical inquiry and language typology. This chapter, which has
discussed some interface issues through the case studies of Xiamen and
Pingyao, has demonstrated that some theories (i.e., OT) fail to capture the TS
nature of these two Chinese dialects, and that the interface theory within the
OT framework does not have an explanatory power superior to that of the
theory proposed before the OT era.

1 Here tone shapes are symbolized by a numerical notation, where 5 equals the
highest and 1 equals the lowest on a 5-​point scale. The last two tones are restricted
to “checked” syllables, while the other five co-​occur with “free” syllables.
2 T stands for base tone, T’ for sandhi tone, and α for sandhi domain.
3 -​
p, -​t, -​k, and -​q here stand for the checked syllable, and -​q for the glottal ending.
4 For a detailed discussion on the distinction between VP-​adjunct and sentential
adjunct, see Tang (1990).
5 Here the symbol ‘#’ stands for the boundary between tone groups (TG), and
the TS rule is applied within TG but blocked across TG; the symbol ‘=’ is used
occasionally for highlighting the obligatory application of the TS rule at certain
junctions; and the letter ‘n’ for neutral tone.
6 In (32), where X* stands for zero or more occurrences of some maximal projection,
X is called a zero-​bar projection, X’ a single-​bar projection, and X” a double-​bar
(or maximal) projection.

310 Hongming Zhang

7 This TG formation is suitable only to TS above the Xiamen phrasal level. As for
the TS of pronoun or grammatical markers, they are different because they belong
to the clitic group (CG) TS.
8 Here tone shapes are symbolized by a letter notation (where L =low, M = middle,
H = high, LM = low rising, and so on), and -​q here stands for the glottal coda of
the checked syllable.
9 The ancient Chinese tonal system consists of four tonal categories, which under-
went a register split into yin and yang tonal registers in Middle Chinese due to
the loss of voice distinction. Yin tone comes from the originally voiceless onset
obstruent, while yang tone comes from the voiced obstruent. Originally, the tonal
value of yin tones is higher in pitch than that of yang tones, but later in some
Chinese dialects, yin tone and yang tone underwent register reversal.
10 In (44), the tone HM in TS position changes to the concave tone MLM. The first
half of MLM retains the high falling property of HM, and if we treat the rising
property at the end of the tone as details of phonetic implementation, then all the
TSA rules in (44) are progressive.

Chen, M. (1985) The syntax of Xiamen tone sandhi. MS, University of California
San Diego.
Chen, M. (1987) “The syntax of Xiamen tone sandhi”, Phonology Yearbook, 4, pp.
Chen, M. (1990) “What must phonology know about syntax?” in Inkelas, S., and Zec,
D. (eds.) The phonology–​syntax connection. Chicago: University of Chicago Press,
pp. 19–​46.
Chen, M. (1992) Argument vs. adjunct: Xiamen tone sandhi revisited. MS, University
of California San Diego.
Chen, M. (2000) Tone sandhi: Patterns across Chinese dialects. Cambridge: Cambridge
University Press.
Chen, M., and Zhang, H.-​ M. (1997) “Lexical and post-​ lexical tone sandhi in
Chongming” in Wang, J.-​L., and Norval, S. (eds.) Studies in Chinese phonology,
vol. 1. Berlin: Mouton de Gruyter, pp. 13–​52.
Cheng, R.-​L. (1968) “Tone sandhi in Taiwanese”, Linguistics, 41, pp. 19–​42.
Cheng, R.-​L. (1973) “Some notes on tone sandhi in Taiwanese”, Linguistics, 100,
pp. 5–​25.
Cheng, R.-​L. (1991) “Interaction, modularization, and lexical diffusion: Tone sandhi
in Taiwanese verbs”. Paper presented at the 3rd North America Conference on
Chinese Linguistics, Ithaca, NY.
Chomsky, N. (1981) Lectures on government and binding. Dordrecht Holland: Foris.
Chomsky, N. (1986) Barrier. Cambridge, MA: MIT Press.
Chomsky, N. (1995) A minimalist program. Cambridge, MA: MIT Press.
Chomsky, N., and Halle, M. (1968) The sound pattern of English. New York: Harper
and Row.
Chung, R.-​F. (1989) Aspects of Ke-​jia phonology. Ph.D. Diss., University of Illinois,
Duanmu, S. (1990) A formal study of syllable, tone, stress and domain in Chinese
Languages. Ph.D. Diss., Massachusetts Institute of Technology.

Prosodic studies of two Chinese dialects 311

Ghini, M. (1993) Phonological phrase formation in Italian: A new proposal. MS,
University of Toronto.
Hou, J.-​Y. (1980) “Pingyao fangyan de liandu biandiao [Tone sandhi in Pingyao]”,
Fangyan [Dialect], 1, pp. 1–​14.
Hsiao, Y.-​C. (1991) Syntax, rhythm, and tone: A triangular relationship. Ph.D. Diss.,
University of California San Diego.
Hsu, H.-​C. (1992) “Domain of tone sandhi in idioms: A tug of war between the foot
formation rule and the tone group formation”. Paper presented at the 4th North
America Conference on Chinese Linguistics, Ann Arbor, Michigan.
Huang, C. T. James. (1982) Logical relations in Chinese and the theory of grammar.
Ph.D. Diss., Massachusetts Institute of Technology.
Huang, C. T. James. (1990) “Reconstruction, the A/​A’ distinction, and the structure of
VP”. Paper presented at the Second Northeast Conference on Chinese Linguistics,
Philadelphia, PA.
Huang, C. T. James. (1991) “Verb movement, (in)definiteness, and the thematic
hierarchy” in Paul, J.-​K. Li et al. (eds.) Proceedings of the Second International
Symposium on Chinese Languages and Linguistics, vol. 2. Taiwan, pp. 481–​498.
Ito, J., and Mester, A. (2012) “Recursive prosodic phrasing in Japanese” in Borowsky,
T., Kawahara, S., Shinya, T., and Sugahara, M. (eds.), Prosody matters: Essays in
honor of Elisabeth Selkirk. London: Equinox, pp. 280–​303.
Kaisse, E. M. (1985) Connected speech: The interaction of syntax and phonology.
New York, San Diego: Academic Press.
Kaisse, E., and Zwicky, A. M. (1987) “Syntactic influences on phonological rules”,
Phonology Yearbook, 4, pp. 3–​11.
Lin, J.-​W. (1994) “Lexical government and tone group formation in Xiamen Chinese”,
Phonology, 11, pp. 237–​275.
McCarthy, J., and Prince, A. (1990) “Prosodic morphology and templatic morph-
ology” in Eid, M., and McCarthy, J. (eds.) Perspectives on Arabic linguistics: Papers
from the Second Symposium. Amsterdam: John Benjamins, pp. 1–​54.
McCarthy, J., and Prince, A. (1993) “Generalized alignment” in Booij, G., and van
Marle, J. (eds.) Yearbook of morphology. Dordrecht: Kluwer, pp. 79–​153.
Nespor, M., and Vogel, I. (1986) Prosodic phonology. Dordrecht: Foris.
Nespor, M., and Vogel, I. (2007) Prosodic phonology: With a new foreword.
Berlin: Mouton de Gruyter.
Prieto, P. (2005) “Syntactic and eurhythmic constraints on phrasing decisions”, Studia
Linguistica, 59, pp. 194–​222.
Prieto, P. (2006) “Phonological phrasing in Spanish” in Colina, S., and Martinez-​
Gil, F. (eds.) Optimality-​ theoretic advances in Spanish phonology. Amsterdam;
Philadelphia: John Benjamins, pp. 39–​60.
Prince, A., and Smolensky, P. (1993) Optimality Theory: Constraint interaction in gen-
erative grammar. Cambridge, MA: MIT Press.
Poser, W. (1981) Some topics in non-​linear phonology. MS, Massachusetts Institute
of Technology.
Poser, W. (1985) “There is no domain size parameter”, Glow Newsletter, 14, pp. 66–​67.
Reinhart, T. (1976) The syntactic domain of anaphora. Ph.D. Diss., Massachusetts
Institute of Technology.
Reinhart, T. (1981) “Definite NP anaphora and C-​command”, Linguistic Inquiry, 12,
pp. 605–​635.

312 Hongming Zhang

Sandalo, F., and Truckenbrodt, H. (2002) “Some notes on phonological phrasing in
Brazilian Portuguese”, MIT Working Papers in Linguistics, 42, pp. 285–​310.
Selkirk, E. (1984) Phonology and syntax: The relation between sound and structure.
Cambridge, MA: MIT Press.
Selkirk, E. (1986) “On derived domain in sentence phonology”, Phonology Yearbook,
3, pp. 371–​405.
Selkirk, E. (1995) “Sentence prosody: Intonation, stress, and phrasing” in Goldsmith,
J. A. (ed.) The handbook of phonological theory. Cambridge, MA; Oxford,
UK: Blackwell, pp. 550–​569.
Selkirk, E. (1996) “The prosodic structure of function words” in Morgan, J. L., and
Demuth, K. (eds.) Signal to syntax: Bootstrapping from speech to grammar in early
acquisition. Mahwah, NJ: Lawrence Erlbaum Associates, pp. 187–​214.
Selkirk, E. (2000) “The interaction of constraints on prosodic phrasing” in Horne, M.
(ed.) Prosody: Theory and experiments. Dordrecht: Kluwer, pp. 231–​262.
Selkirk, E. (2006) “Strong minimalist spell-​ out of prosodic phrases.” Paper
presented at GLOW Workshop on Prosodic Phrasing, Universitat Auònoma
Selkirk, E. (2009) On clause and intonational phrase in Japanese: The syntactic grounding
of prosodic constituent structure. Gengo Kenkyu.
Selkirk, E. (2011) “The syntax-​phonology interface” in Goldsmith, J., Riggle, J., and
Yu, A. (eds.) The handbook of phonological theory. Oxford: Blackwell, pp. 435–​484.
Shen, Y. (1988) A tentative hypothesis regarding tri-​syllabic tone sandhi in Pingyao.
MS, University of California San Diego.
Shih, C.-​L. (1986) The prosodic domain of tone sandhi in Chinese. Ph.D. Diss.,
University of California San Diego.
Steriade, D. (1987) “Locality conditions and feature geometry”, NELS, 17, pp.
Tang, J. (1990) Chinese phrase structure and the extended X’-​theory. Ph.D. Diss.,
Cornell University.
Truckenbrot, H. (1995) Phonological phrase: Their relation to syntax, focus, and
prominence. Ph.D. Diss., Massachusetts Institute of Technology.
Truckenbrot, H. (1999) “On the relation between syntactic phrases and phonological
phrases”, Linguistic Inquiry, 30, pp. 219–​256.
Zhang, H.-​M. (1992) Topics in Chinese phrasal tonology. Ph.D. Diss., University of
California San Diego.
Zhang, H.-​M. (2008a) “C-​command approach to tone sandhi in Chinese dialects”,
Dialect, 4, pp. 289–​303.
Zhang, H.-​M. (2008b) “Phrasal phonology and Chinese tone sandhi” in Feng, S., and
Shen, Y. (eds.) Linguistics theory and Chinese studies. Beijing: Commercial Press,
pp. 521–​535.
Zhang, H.-​M. (2008c) “Labial-​labial co-​occurrence constraint in Cantonese”, Revista
da Ciencia Linguistica de Macau, 31/​32, pp. 46–​56.
Zhang, H.-​M. (2014) “Yunlü yinxixue yu hanyu yunlü yanjiuzhong de ruogan wenti”
[Some issues on prosodic phonology and Chinese prosodic studies], Dangdai
Yuyanxue [Contemporary linguistics], 16(3), pp. 303–​327.
Zhang, H.-​M. (2017) Syntax-​phonology interface: Argumentation from tone sandhi in
Chinese dialects. London; New York: Routledge.
Zhang, H.-​M., and Chen, M. Y. (1995) “Morphosyntactic diffusion hypothesis” in
Zee, E. (ed.) New Asia Academic Bulletin, vol. 11: Studies of the Wu dialects. Hong
Kong: Chinese University Press, pp. 69–​89.

Prosodic studies of two Chinese dialects 313

Zhang, H.-​M., and Yin, Y.-​X. (2012) “Youxuanlun de shiyufei xiandai yinxixue yanjiu
de ruogan fansi” [Pros and cons of Optimality Theory: Some thoughts on phono-
logical issues], Zhongguo Yuwen [Studies of the Chinese language], 6, pp. 483–​499.
Zhang, J. (1999) “Duration in the tonal phonology of Pingyao Chinese”, UCLA
Working Papers in Linguistics, 3, pp. 147–​206.
Zwicky, A., and Kaisse, E. (eds.) (1987) “Syntactic conditions on phonological rules”,
Phonology Yearbook, 4(1), pp. 3–​11.

Part IV

Prosody in language acquisition


Perceptual development of phonetic
categories in early infancy
Consonants, vowels, and lexical tones
Jun Gao and Rushen Shi

11.1 Introductory remark

One fundamental issue in language acquisition research concerns input-​
guided learning versus input-​independent capacities in children. On the one
hand, researchers strive to understand the way native language input shapes
acquisition. On the other hand, there is a strong interest in determining how
language acquisition may be affected by children’s natural capacities (inde-
pendent of the specific ambient language) such as those present at birth. These
questions apply to various levels of linguistic representations such as syntax,
phonology and phonetics. In this chapter, we discuss key empirical findings in
early phonetic development that shed light on these questions, and we report
our recent experiments on infants’ perception of lexical tones during the first
year of life.

11.2 Perceptual development of consonants and vowels in infants

Research in phonetic and phonological acquisition has contributed valu-
able empirical results on the effect of input versus children’s natural cap-
acities. Perceptual studies with neonates and infants during the first year of
life are directly pertinent. Most studies have concentrated on the perceptual
development of native and non-​native consonants and vowels. It has been
demonstrated that infants are born with the natural capacity to perceive many
phonetic contrasts, both native and non-​native ones, and that their perception
is gradually influenced by the sound structure of the native language during
the course of the first year of life. This was shown in the classic work of Werker
and colleagues (Werker et al. 1981; Werker and Tees 1984). They presented
participants with consonantal contrasts in Hindi and Salish (including a
Hindi retroflex-​dental contrast and a Salish velar-​uvular contrast, both absent
in English). They found that six-​to eight-​month-​old English-​learning infants
discriminated the non-​English contrasts, but their discrimination declined by
ten to twelve months of age. Adult English speakers also failed to discrim-
inate the contrasts. Hindi-​and Salish-​learning infants, however, maintained
their discrimination of their respective native contrast at ten to twelve months

318 Jun Gao and Rushen Shi

of age. A similar pattern of perceptual development was found for vowels. In
Polka and Werker (1994), English-​learning infants discriminated a German
front rounded versus back rounded vowel contrast at four months of age, and
the discrimination deteriorated after six months of age.
These findings suggest that infants begin acquisition with the language-​
general ability to perceive phonetic contrasts, and that as they begin to
acquire the phonological system of the ambient language, native contrasts
are maintained and the non-​native ones become attenuated in perception
and representation. Indeed, during the second half of the first year of life,
infants begin to learn various aspects of their native phonology. Infants start
representing the internal structure of native vowels, such that they respond
differently to prototypical and non-​prototypical tokens of a native vowel cat-
egory (Kuhl 1991; Kuhl et al. 1992). Between six and nine months of age
infants develop sensitivity to the phonotactic regularities (e.g., Mattys and
Jusczyk 2001) and stress patterns (e.g., Jusczyk, Cutler, and Redanz 1993) of
their native language. The narrowing of perception of phonetic contrasts is
coherent with infants’ focus on the native language phonological structures,
and the experience with the ambient speech input thus influences the evolving
perceptual patterns for native versus non-​native sounds.
Later studies revealed a more complex picture of perceptual develop-
ment of consonantal and vowel contrasts in infants. Research showed that
the discrimination of native contrasts is not always about the maintenance
of discrimination from early infancy. For certain contrasts, there is gradual
improvement over age in infants’ discrimination of native contrasts. Kuhl
and colleagues (2006) found that English-​learning infants can discriminate
the English /​r/​-​/l​/​contrast at six to eight months of age, and importantly,
they improve significantly in their discrimination of this contrast between
six and twelve months of age. Likewise, Mandarin-​Chinese-​learning infants’
discrimination of a Mandarin-​Chinese affricate-​fricative contrast enhances
from six to twelve months of age (Tsao, Liu, and Kuhl 2006). Japanese-​
learning infants in Kuhl et al. (2006) and English-​learning infants in Tsao,
Liu, and Kuhl (2006) declined in their discrimination of those non-​native
contrasts during the same age period. Thus, while the lack of input leads
to perceptual decline in non-​native infants, continued input exposure leads
to discrimination improvement in native-​ language infants. The ability
to discriminate certain contrasts is not fully in place at birth. Facilitative
learning occurs during the first year of life as infants gain experience with
the native-​language input.
Further variability has been observed with respect to listeners’ natural
capacities and the effect of input for phonetic perception. For example, the
discrimination level of the English voiced stop versus fricative d-​th (/​d/​–​/ð​ /​)
distinction stays unchanged and is equivalent for both English-​learning and
French-​learning infants throughout the first year of life (even though the con-
trast is present in English but absent in French), and significant improvement
was observed from age one to adulthood in English listeners only (Polka,

Perceptual development 319

Colantonio, and Sundara 2001), indicating a delayed effect of input on
learning. There are also cases in which the discrimination of certain speech
sounds is absent at birth, and infants must rely entirely on phonetic learning
from the input (Narayan, Werker, and Beddor 2010). In Narayan et al.
(2010) there was no discrimination of the Filipino syllable-​initial alveolar-​
velar nasals, which are acoustically similar, in Filipino-​and English-​learning
infants during early infancy. Filipino infants eventually learned to discrim-
inate this contrast by ten to twelve months of age, whereas English-​learning
infants across ages consistently failed to make the discrimination. Narayan
et al. (2010) interpreted their findings in terms of acoustic salience. That is, the
acoustic cues to the contrasting nasal consonants are too weak. In this sense,
the innate language-​general perceptual ability shown in previous studies (e.g.,
Werker and Tees 1984; Polka and Werker 1994) appears to require certain
basic acoustic saliency.
On the other hand, certain non-​native contrasts are well discriminated
from early infancy to adulthood despite missing experience, as in the
case of English infants’ and adults’ discrimination of Zulu clicks (Best,
McRobert, and Sithole 1988). Based on the perceptual assimilation model
(PAM; Best, 1995), the discrimination of non-​native contrasts is related
to whether the sounds are assimilable to native phonetic categories and
how they are assimilated to the native categories. According to this model,
Zulu clicks remain discriminable to English listeners because they are non-​
assimilable to any English phonemic categories. It is possible that the clicks
were perceived as non-​speech sounds by the English listeners, and that the
general auditory system was sensitive to their acoustic differences in a non-​
categorical fashion.
Variable results have also been reported for vowels. Polka and Bohn
(1996) found that English and German adults discriminated both an
English vowel contrast (dat-​det) and a German vowel contrast (d/​u/​t-​d/​y/​
t), even though the non-​native contrasts are absent in their respective native
languages. Furthermore, six-​to eight-​month-​old and ten-​to twelve-​month-​
old English-​and German-​learning infants showed comparable discrimin-
ation of these native and non-​native contrasts, and there was no difference
in performance across those ages. Therefore, whereas the early discrim-
ination reflects infants’ language-​general natural perceptual ability, the
basis for the continued discrimination of those non-​native vowel contrasts
during later infancy and adulthood is unclear. It may be a manifestation of
the innate natural perceptual capacities that persist. Alternatively, it may
be due to listeners’ perceptual assimilation of those non-​native contrasts
to their nearest native vowel contrasts, consistent with the view of PAM
(Best 1995).
In sum, research on perceptual development of consonants and vowels
revealed evidence supporting both input-​independent natural perceptual cap-
acities and input-​guided phonetic learning. Both mechanisms exert effects
during the course of acquisition and are contrast-​dependent.

320 Jun Gao and Rushen Shi

11.3 Perceptual development of lexical tones in infants

Phonemic inventories in natural languages not only include consonants and
vowels but also suprasegmental categories such lexical tones. Many world
languages (for example, in Asia) contain tonal contrasts for distinguishing
word meaning. For example, in Mandarin ma1 and ma3 are minimal pairs
contrasting in tones (Tone 1: high-​level versus Tone 3: low-​dipping) and
denote different meanings (ma1 “mother” versus ma3 “horse”). The typical
acoustic correlate for tones is the fundamental frequency (i.e., pitch) of the
tone-​bearing unit (usually the vowel or the syllable), although other acoustic
properties such as the duration and amplitude of the tone-​bearing unit may
also cue tonal distinctions.
Relative to the abundant literature on early perceptual development of
consonants and vowels, fewer studies have investigated infants’ perception
of lexical tones. The study of lexical tones is relevant for the issue of input-​
driven learning versus input-​independent natural capacities in the acquisition
of phonetic categories, as it is interesting to know if the acquisition of lexical
tones is governed by the same mechanisms as those that underlie the acquisi-
tion of consonants and vowels.
The published studies so far have yielded variable results. Mattock and
colleagues (Mattock and Burnham 2006; Mattock et al. 2008) reported
a similar developmental trajectory in infants’ perception of lexical tones
as shown for consonants (Werker and Tees 1984) and vowels (Polka and
Werker 1994). In their experiments, English-​and French-​learning infants
discriminated the Thai low-​level versus rise contrast at four and six months of
age, but failed to do so at nine months of age, suggesting that infants were uni-
versal listeners of lexical tones early in life, and that the lack of tonal contrasts
in English led to the decline of tonal discrimination at nine months of age.
Infants who were Cantonese-​and Mandarin-​acquiring continued to discrim-
inate the Thai tonal contrast at nine months of age, presumably because their
native languages, which contain the similar tonal contrast, influenced their
discrimination of those Thai tones.
Yeung, Chen, and Werker (2013) examined infants’ perception of a con-
trast in Cantonese that is similar to the contrast in Mattock et al. (2008), mid-​
level versus rise tones. They compared the performance of non-​tone-​learning
(English), non-​ native tone-​
learning (Mandarin), and native-​ tone-​learning
(Cantonese) infants. They found evidence of discrimination in four-​month-​
olds of all three language groups, suggesting that infants responded as uni-
versal listeners of lexical tones. At nine months, the English-​learning infants
no longer discriminated the Cantonese tones, whereas the two Chinese groups
continued to show evidence of discrimination. Their results, however, are
difficult to interpret –​the three groups of infants did not always yield the
predicted pattern of responses. In particular, a preference for alternating
trials (i.e., both tones presented within a trial) over non-​alternating trials (the
level tone in some trials and the rise tone in other trials) was predicted for

Perceptual development 321

successful tonal discrimination. In some of their experimental conditions,
infants preferred the alternating trials over trials presenting one of the tones,
but not over trials presenting the other tone. For example, Mandarin-​learning
infants looked longer in alternating trials than in the level-​tone trials only.
Their looking to the rise-​tone trials and alternating-​tone trials were compar-
ably high. The English-​learning four-​month-​olds preferred the alternating
trials to the rise-​tone trials, but their responses to the mid-​level-​tone trials and
alternating-​tone trials were similar. Furthermore, among the Chinese infants,
one of the familiarization sub-​groups (i.e., the group familiarized with the
mid-​level tone) did not show any discrimination during the test phase. These
results seem puzzling. Nevertheless, the overall decline in English-​learning
infants’ discrimination from four to nine months of age is consistent with the
results of Mattock et al. (2008).
There is also evidence that non-​tone-​learning infants’ discrimination of cer-
tain lexical tones persists throughout the first year of life despite no experience
with tones. Liu and Kager (2014) examined the perception of the Tone 1 (high-​
level) and Tone 4 (fall) contrast in Mandarin in Dutch-​learning infants aged five
to eighteen months. Infants across the age range all successfully discriminated
the contrast. Their responses were phonetic rather than phonological, since
Dutch does not have lexical tones. In Shi, Santos, Gao and Li (2017) 4-​, 8-​, and
11-​month-​old infants whose native language was French, a non-​tonal language,
also showed no decline in discriminating Tone 1 and Tone 4. Similarly, 18-​
month-​old English-​learning infants discriminated Mandarin Tone 2 (rise) and
Tone 4 (fall) in a word-​learning task involving tonal mispronunciations (Singh
et al. 2014). Even non-​tone-​speaking adults show some degree of perceiving
certain tonal contrasts in Mandarin (So and Best 2010). Acoustic salience may
be a factor accounting for non-​tone-​learning infants’ sustained discrimination
of these contrasts. The 4-​to 11-​month-​old French-​learning infants in Shi,
Santos, Gao and Li (2017) showed a tendency to decline over age in their dis-
crimination of the more similar Tone 2 –​Tone 3 contrast. Consistent with this
idea, when Liu and Kager (2014) artificially reduced the pitch differences of
their naturally produced stimuli (Tone 1 and Tone 4), infants showed a decline
in discriminating the tones from eight to fifteen months of age.
Tsao (2008) tested the role of acoustic salience in the discrimination of lex-
ical tones in Mandarin-​learning ten-​to twelve-​month-​old infants, using three
tonal contrasts in Mandarin, Tone 1 (high-​level) –​Tone 3 (low-​dipping), Tone
2 (rise) –​Tone 4 (fall), and Tone 2 –​Tone 3. Infants discriminated the acoustic-
ally most distinct Tone 1 –​Tone 3 contrast significantly better. The latter two
contrasts (Tone 2 –​Tone 3; Tone 2 –​Tone 4) did not differ in discrimination,
both poorer than Tone 1 –​Tone 3. In another study, however, Mandarin-​
learning infants aged eight to eleven months categorically discriminated Tone
2 and Tone 4 even when the tones were embedded in variable tonal contexts
(Shi 2009).
Taken together, the perception of lexical tones by non-​tone-​learning infants
is contrast dependent, with some contrasts showing the language-​universal

322 Jun Gao and Rushen Shi

to language-​specific developmental trajectory (same as certain consonants
and vowels), but with some other tones remaining discriminable throughout
infancy despite lack of relative experience. The development of native-​tone-​
learning infants is little understood. Among the few existing studies, Yeung,
Chen, and Werker’s (2013) results were mixed and inclusive, and Tsao (2008)
only tested infants aged ten to twelve months but not younger. In addition,
the Headturn Conditioned Procedure in Tsao (2008) involved training the
infants on the tonal contrasts that were subsequently tested; thus, infants’
spontaneous discrimination of the tones remains unclear. In the next section
we report our experiment on the perceptual development of native tones
during the first year of life.

11.4 The experiment

To better understand the effect of natural perceptual capacity versus input-​
driven learning in the early development of native lexical tones, we examined
Mandarin-​learning infants’ perception of Mandarin tones from four to thir-
teen months of age. We used a habituation procedure that tested infants’
spontaneous responses to different tones without any training. Two tonal
contrasts, Tone 2 –​Tone 3 and Tone 1 –​Tone 4, were tested, allowing us
to examine whether there were contrast-​dependent effects in infants’ tonal
perception. Furthermore, we used multiple exemplars for each tone, and
crucially, the exemplars for the habituated tone during the test phase were
different from those during habituation. This aspect differed from Tsao (2008)
and Yeung et al. (2013), in which the same exemplars were used throughout
training/​familiarization and the test phases. The change of exemplars across
experimental phases for the same tone ensured that our task definitively tested
infants’ generalized knowledge about tonal categories beyond the memoriza-
tion of specific exemplars heard during habituation.
In Mandarin-​Chinese, there are four lexical tones, high-​level (Tone 1), rise
(Tone 2), low-​dipping (Tone 3), and fall (Tone 4). Tone 2 and Tone 3 are gen-
erally considered acoustically similar, as they are both contour tones starting
from the mid-​part of the pitch range and ending higher in the pitch trajectory,
although their trajectories differ. They are also different in terms of mode of
phonation: Tone 3 is often produced with creaky voice. It is unknown if this
characteristic plays a role in infants’ tonal discrimination. The tones of the
other contrast that we tested, Tone 1 –​Tone 4, shared the same pitch height at
the tonal onset, and their pitch trajectories diverge, with Tone 1 staying high
and Tone 4 moving downward. Tone 1 is typically longer than Tone 4. The
two tones thus seem to have salient acoustic differences, and they were dis-
criminable to both infants and toddlers whose native language contains no
contrastive tones (Liu and Kager 2014; Shi, Santos, Gao and Li 2017). Tsao
(2008) showed that Tone 2 and Tone 3 were more difficult to discriminate
than Tone 1 versus Tone 3 for Mandarin-​learning one-​year-​olds, but the Tone
2 –​Tone 3 contrast was not more difficult than the Tone 2 –​Tone 4 contrast.

Perceptual development 323

It is unknown where the Tone 1 –​Tone 4 contrast situates relative to the
other contrasts in native-​tone-​learning infants’ discrimination. In So and Best
(2010), the Tone 2 –​Tone 3 and Tone 1 –​Tone 4 contrasts were comparably
confusable to English-​speaking adults, whereas the Tone 1 –​Tone 3 contrasts
were better perceived. According to So and Best (2010), the Tone 2 –​Tone
3 and Tone 1 –​Tone 4 contrasts were comparable in their perceptual sali-
ence because both contrasts contain tones that share pitch features (e.g., pitch
height at onset/​offset, pitch contour, etc.). However, phonologists consider
contour tones as generally more complex than level tones (Yip 2002). This
may mean that Tone 2 and Tone 3 are more difficult for discrimination than
Tone 1 versus Tone 4, since the former contrast involves two contour tones,
whereas the latter contains one level tone and one contour tone. There is no
consensus regarding what determines perceptual salience, and the answer
requires more experimental work.
In our experiment we examined how Mandarin-​learning infants’ percep-
tion of Tone 1 –​Tone 4 and Tone 2 –​Tone 3 evolves during the first year
of life. In a prior study (Shi, Gao, Achim and Li 2017) we had tested the
discrimination of Tone 2 versus Tone 3 in a group of Mandarin-​learning 4-​
to 13-​month-​old infants. Here we again tested this tonal contrast with two
different age groups. The two particular contrasts (Tone 2 –​Tone 3; Tone 1 –​
Tone 4) have been shown to be perceptually more confusable than other
contrasts in previous studies; we thus chose them to test whether experience
with the native language during the first year of life can yield improvement in
infants’ discrimination of the tones. We expected the Tone 2 –​Tone 3 contrast
to be relatively harder due to their lower acoustic salience (Tsao 2008), higher
phonetic feature similarity (So and Best 2010), and greater phonological com-
plexity (Yip 2002).
Stimuli. Two lexical tone contrasts, Tone 2 (rising) –​Tone 3 (low-​dipping) and
Tone 1 (high-​level) –​Tone 4 (falling) were used for our experiment. The Tone
2 –​Tone 3 stimuli were the same as those in Shi, Gao, Achim and Li (2017).
The tone-​bearing syllable was can (the pinyin alphabet) for the T2-​T3 con-
trast, and kui for the T1-​T4 contrast. The reason for choosing these syllables
was that the morphemes represented by these syllables with the four tones are
all unfamiliar to infants and young children, thus controlling for the factor
of meaning. A Mandarin-​Chinese-​speaking female produced the stimuli in
the infant-​directed speech style in an acoustic chamber. During recording,
she produced multiple exemplars of the syllables with all four tones, which
ensured that the relative tone height and contours for the tones fell within the
natural pitch range of the speaker. The stimuli were recorded with a 22 kHz
sampling frequency, 16-​bit resolution. The final selected stimuli for each target
tone consisted of 13 tokens. The mean duration of the T2 tokens was 718 ms
(max = 806 ms, min = 631 ms) with the standard deviation of 63 ms. The
mean duration of the T3 tokens was 717 ms (max = 802 ms and min = 630 ms)
with the standard deviation of 63 ms. An independent t-​test showed that the
duration of the tokens of the two tones did not differ, t(24) = 0.47, p = 0.963.

324 Jun Gao and Rushen Shi

For T3, ten out of thirteen tokens had creaky voice, with six of the ten creaky
tokens used for habituation and four used for test. The mean duration of T1
tokens was 585ms (max = 645 ms, min = 503 ms) with the standard deviation
of 35 ms. The mean duration of the T4 tokens was 494 ms (max = 536 ms
and min = 464 ms) with the standard deviation of 22 ms. An independent
t-​test showed that the tokens of T1 were significantly longer than those of
T4, t(24) = 7.861, p = 0.000. All tokens were adjusted to comparable ampli-
tude using Cool Edit Pro 2.0. Figure 11.1 shows example tokens of the two
contrasts. In addition, we designed a visual stimulus, a colorful checkerboard-​
like geometrical image, which was presented along with the speech stimuli
during the experiment.
Participants. Participants were a total of 62 monolingual Mandarin-​Chinese-​
learning infants who resided in Beijing and heard standard Mandarin at
home. Infants formed four groups defined by tonal contrast and age: T2-​
T3 younger group (n = 16, Mean: 6 months 5 days, Age Range: 5 months
26 days –​6 months 29 days); T2-​T3 older group (n = 14, Mean: 8 months
22 days, Age Range: 7 months 10 days–​11 months 0 days); T1-​T4 younger
group (n = 16, Mean: 5 months 18 days, Age Range: 4 months 15 days –​
6 months 25 days); and T1-​T4 older group (n=16, Mean: 11 months 20 days,
Age Range: 9 months 19 days –​13 months 6 days).



F0 (Hz)





F0 (Hz)




Figure 11.1 (a) Pitch trajectories of example stimuli of Tone 2 and Tone 3. The broken
part in the mid-​section of the Tone 3 pitch curve stands for creaky voice.
(b) Pitch trajectories of example stimuli of Tone 1 and Tone 4

Perceptual development 325

Apparatus. The experiment was conducted in a quiet room, where the infant
sat on the mother’s lap facing a computer screen. Loudspeakers were placed
on both sides of the screen and played auditory stimuli simultaneously. The
display screen and the loudspeakers were connected to a computer in the con-
trol room. Under the screen a camera transmitted the video of the infant to
a computer in the control room. Blind to the stimuli of the experiment, the
experimenter in the control room outside the testing room operated the com-
puter to run the experiment program and coded online the infant’s looking
to and away from the screen. The experiment program was pre-​set to pre-
sent the audio and visual stimuli contingent upon the infant’s looking to the
screen. The program also recorded the looking-​time data automatically and
performed the habituation calculation online. During the experiment, the
mother listened to masking music through headphones (Peltor HTM79A).
She was asked not to interact with, interrupt, or influence the infant.
Procedure. The habituation paradigm was adopted. Each infant was
habituated with one of the two tones in a contrast. Seven tokens of the tone
were presented randomly and repeatedly across trials during the habituation
phase. The experimental program recorded online the looking time of each
trial. The looking time of each sliding window of three consecutive trials
was compared online with the looking time of the first window of three
trials. The habituation criterion was reached if the looking time in a later
window declined to 50 percent or lower of the first window of trials, and
the experiment proceeded into the test phase automatically. In the test phase
there were two trial types, Same and Different. The Same type presented
six novel tokens of the same tone that had been presented in the habitu-
ation phase. The Different type presented six tokens of the contrasting tone.
That is, the test stimuli were all new, with the Same exemplars belonging
to the habituated tonal category, and the Different exemplars belonging to
the other tone that had never appeared during habituation. For both the
habituation and test phases, the inter-​stimulus-​interval (ISI) within a trial
was 1000 ms, and the maximum trial length was 21s. Each trial was initiated
upon the infant’s looking, and was terminated if he or she looked away for
more than two seconds or if the maximum trial length was reached. When a
trial stopped, an attention-​getter, an animation of a jumping star, popped up
automatically to attract the infant’s attention back to the screen. The visual
stimulus, a colorful checkerboard-​like geometrical image occurred simultan-
eously with the speech stimuli during each trial. In addition, a pre-​trial and
a post-​trial were presented at the beginning and the end of the experiment.
These trials presented a zooming picture of a cat. During the pre-​test trial,
the cat image was accompanied by the following speech: Zhe shi shenme?
(“What’s this?”) Mao (“cat”), mao, mao; zhe shi mao (“this is a cat”), mao,
mao; yi zhi mao (“a cat”), mao, mao”. During the post-​test trial, the auditory
stimulus was only the word Mao, which was presented repeatedly. The pre-​
trial served to acquaint the infant with the equipment. The post-​trial helped
us judge whether infants were still on task toward the end of the experiment,

326 Jun Gao and Rushen Shi

as looking time should increase in the post-​trial because the stimuli were dis-
tinct from those in the preceding trials.
Design. Infants were divided into two main groups, one for the T2-​T3 contrast,
and the other for the T1-​T4 contrast. Within each contrast group, half of the
infants were habituated with one tone, and the other half with the other tone.
All of them then heard new exemplars of both tones in different test trials.
Same and Different test trials were relative to the particular habituation tone.
For example, for the T2 habituation infants, T2 was the Same test trial type, and
T3 the Different type. The reverse was the case for the T3 habituation infants,
with T3 being the Same and T2 being the Different test trials. The first test trial
was either the Same type or Different type, counterbalanced across infants.
The looking time during the test trials was the dependent variable.
The rationale of the habituation paradigm was that once infants became
habituated with one tone, they should show renewed interest upon hearing
a different tone in the test phase if they could discriminate the tones. In our
design, the exemplars for both the Same and Different test tones were novel.
We therefore predicted that if infants could categorize the tones of a con-
trast, they should look significantly longer in the Different than in the Same
test trials, even though all test stimuli were novel. If infants could not cat-
egorize the contrasting tones, looking time to the Same and Different test
trials should not differ.
Results. Looking time during the test trials was analyzed in a 2x2x2 mixed
ANOVA, with Trial Type (Same, Different) as the within-​subject factor, Age
(younger, older) as the between-​subject factor, and Contrast (T2-​T3, T1-​T4)
as the between-​subject factor. The results showed a significant main effect
of Trial Type (F(1, 58) = 7.519; p = 0.008). There was no effect of Age,
F(1,58) = 0.62, p = 0.434, and no effect of Contrast, F(1, 58) = 1.13; p = 0.292.
Furthermore, we found no significance in any of the interactions. Figure 11.2
shows that looking times were significantly longer in Different than in Same

Perception of T2–T3 and T1–T4 contrasts

Looking time (sec)

4 Same

T2–T3 T1–T4

Figure 11.2 Results of both younger and older Mandarin-​learning infants for the
Tone 2 –​Tone 3 (left two columns) and for Tone 1 –​Tone 4 (right two
columns) contrasts. Looking times (means and standard errors) were sig-
nificantly longer in Different than in Same test trials

Perceptual development 327

trials for each contrast. Because infant looking behavior can be quite variable,
with some infants being overall long-​lookers and others overall short-​lookers,
we log-​transformed the raw looking times so as to reduce such variability. The
same ANOVA was also conducted on the log-​transformed data. The result
pattern was identical to that of the raw data, with a significant main effect
of Trial Type, F(1, 58) = 12.999; p = 0.001, and no other significant main
effect nor interaction. These results indicate that Mandarin-​learning infants
perceived the two tonal contrasts in Mandarin successfully at both early and
later stages of the first year of life. That is, T2-​T3 and T1-​T4 were equally
perceptible to infants from four to thirteen months of age. T2-​T3, a contrast
generally considered the most acoustically similar among Mandarin tones,
did not show a different pattern of development than T1-​T4.

11.5 General discussion

The results of previous studies in the literature provide evidence for the exist-
ence of both input-​independent discrimination and input-​guided learning of
phonetic categories. The natural capacity to discriminate certain phonetic
categories is universally available at birth; the discrimination is maintained
if the input language continues to support those contrasts, but the discrim-
ination gradually declines if the contrast is absent in the input. This pattern
was shown for various consonants and vowels. In the limited studies on lex-
ical tones, early tonal discrimination and later decline were observed in non-​
tone-​learning infants (Mattock and Burnham 2006; Mattock et al. 2008; Shi,
Santos, Gao and Li 2017; Yeung et al. 2013). Our present study demonstrated
the continuing ability to categorize native tone contrasts in Mandarin-​
learning infants from four to thirteen months of age, consistent with the idea
that infants are born as universal listeners, and that input experience serves to
maintain the perceptual sensitivity to native tonal contrasts.
Besides the evidence of maintenance in phonetic development, previous
research on consonants has shown that experience with the ambient language
can exert an enhancement effect for certain contrasts (e.g., Kuhl et al. 2006;
Tsao, Liu, and Kuhl 2006; Narayan, Werker, and Beddor 2010). With respect
to lexical tones, it is unclear whether there is input-​driven facilitation of tonal
discrimination over age. Our present study tested tonal contrasts that are pre-
sumably less salient perceptually (including the most similar T2-​T3 contrast
in Mandarin), offering a potential opportunity for observing gradual learning
from input exposure. However, we found that Mandarin-​learning infants
perceived the tonal contrasts at both younger and older ages during the first
year of life, showing no evidence of improvement.
The effect of perceptual salience for early phonetic development has been
much discussed in the field. Acoustically, more distinct contrasts are assumed
to be easier for discrimination, for both native and non-​native contrasts; con-
versely, acoustically similar contrasts should be more difficult. Supporting
evidence was reported in experiments that tested infants’ discrimination of

328 Jun Gao and Rushen Shi

different tonal contrasts (e.g., Tsao 2008). T2 and T3 in Mandarin are gen-
erally considered the most similar contrast in terms of pitch patterns; never-
theless, the creaky mode of phonation in T3 might be helpful cues. T1-​T4
may arguably be more distinct in their pitch patterns, and their durations
clearly differ. Even non-​tone infants and toddlers can discriminate this con-
trast (Liu and Kager 2014; Shi, Santos, Gao and Li 2017). On the other
hand, the T2-​T3 and T1-​T4 contrasts seemed to be equally confusable for
adult non-​tone listeners in the study of So and Best (2010), who explained
this result on the basis of their comparable degree of similarity in phonetic
features (i.e., pitch-​based features such as High and Low). Thus, if the pitch
features of T2-​T3 are considered as LH-​LL and the T1-​T4 as HH-​HL, the
tones within each contrast share the onset feature. However, the citation form
of T3 has a final rise, making it LLH in pitch contour, which is why T2 are
T3 are generally regarded as similar in the field. In our experiment, the per-
ception of these two contrasts did not differ for Mandarin-​learning infants.
They discriminated both contrasts equally well, suggesting that acoustic cues
beyond pitch patterns (such as creaky phonation) may contribute importantly
to the comparable perceptual salience of the two contrasts. Future studies
should examine whether these two contrasts differ from the most distinct con-
trast T1-​T3 in infants’ perceptual development.
The comparable discrimination for the T2-​T3 and T1-​T4 contrasts in
our study cannot be explained in terms of contour tones versus level tones
described in phonological theory (Yip 2002). The T2-​T3 contrast contains
two contour tones, with T3 being a complex contour tone. The T1-​T4 con-
trast, on the other hand, contains one level tone and one falling contour, which
should be easier for discrimination than the T2-​T3 contrast. Our results are
not consistent with this prediction. Contour tones are not necessarily more
difficult for perception than level tones. Relative perceptual salience for tonal
contrasts appears to depend on the exact acoustic-​phonetic differences of the
contrasting tones.
Our study further demonstrates that infants can categorize lexical tones.
In the test phase of our experiment both Same and Different trials presented
novel stimuli, unlike previous studies, which presented the same exemplars
throughout the experiment for the same-​tone category but new exemplars for
the contrasting tone. Hence, our infants’ responses to the test stimuli could
not be simply due to a stronger interest in new versus old stimuli. Rather, they
perceived the Same-​trial new stimuli as belonging to the same tonal category
of the habituation exemplars, and their stronger interest in the Different-​
trial stimuli suggests that they perceived them as belonging to a contrasting
tonal category. In this sense, their tonal perception showed a certain degree
of abstractness.
In conclusion, based on the findings from studies in perceptual develop-
ment of phonetic categories, especially those from infant studies, we now know
more about infants’ initial state of speech-​processing capacities and the role
of input for later phonetic development. Input-​independent processing and

Perceptual development 329

input-​guided learning are both involved during acquisition. Furthermore, the
perceptual system functions similarly for segmental categories (consonants
and vowels) as well as for suprasegmental categories such as lexical tones,
suggesting that they belong to a common phonetic-​phonological system, which
are subject to the same underlying mechanisms of acquisition and processing.

Author notes
The experiment reported in this chapter formed part of the doctoral thesis
of the first author. The data were presented in the 2010 Speech Prosody and
2011 BUCLD meetings. This research was supported by grants from the
National Social Science Fund of China (Project No.: 08AYY02) and from the
Natural Sciences and Engineering Research Council of Canada (NSERC).
Corresponding authors for this article: Rushen Shi,; Jun

Best, C. T. (1995) “A direct realist view of cross-​language speech perception” in
Strange, W. (ed.) Speech perception and linguistic experience: Issues in cross-​language
research. Timonium, MD: York Press, pp. 171–​204.
Best, C. T., Mcroberts, G. W., and Sithole, N. M. (1988) “Examination of percep-
tual reorganization for nonnative speech contrasts: Zulu click discrimination by
English-​speaking adults and infants”, Journal of Experimental Psychology: Human
Perception and Performance, 14(3), pp. 345–​360.
Harrison, P. (2000) “Acquiring the phonology of lexical tone in infancy”, Lingua,
110(8), pp. 581–​616.
Jusczyk, P. W., Cutler, A., and Redanz, N. (1993) “Preference for the predominant
stress patterns of English words”, Child Development, 64, pp. 675–​687.
Kuhl, P. K. (1991) “Human adults and human infants show a ‘perceptual magnet
effect’ for the prototypes of speech categories, monkeys do not”, Perception and
Psychophysics, 50, pp. 93–​107.
Kuhl, P. K., Stevens, E., Hayashi, A., Deguchi, T., Kiritani, S., and Iverson, P. (2006)
“Infants show a facilitation effect for native language phonetic perception between
6 and 12 months”, Developmental Science, 9(2), pp. F13–​F21.
Kuhl, P. K., Williams, K. A., Lacerda, F., Stevens, K. N., and Lindblom, B. (1992)
“Linguistic experience alters phonetic perception in infants by 6 months of age”,
Science, 255, pp. 606–​608.
Liu, L., and Kager, R. (2014) “Perception of tones by infants learning a non-​tone lan-
guage”, Cognition, 33(2), pp. 385–​394.
Mattock, K., and Burnham, D. (2006) “Chinese and English infants’ tone percep-
tion: Evidence for perceptual reorganization”, Infancy, 10(3), pp. 241–​265.
Mattock, K., Molnar, M., Polka, L., and Burnham, D. (2008) “The developmental
course of lexical tone perception in the first year of life”, Cognition, 106(3), pp.
Mattys, L., and Jusczyk, P. W. (2001) “Phonotactic cues for segmentation of fluent
speech by infants”, Cognition, 78, pp. 91–​121.

330 Jun Gao and Rushen Shi

Narayan, C., Werker, J. F., and Beddor, P. (2010) “The interaction between acoustic
salience and language experience in developmental speech perception: Evidence
from nasal place discrimination”, Developmental Science, 13(3), pp. 407–​420.
Polka, L., and Bohn, O.-​S. (1996) “A cross-​language comparison of vowel perception
in English-​learning and German-​learning infants”, Journal of Acoustical Society of
America, 100, pp. 577–​592.
Polka, L., Colantonio, C., and Sundara, M. (2001) “A cross-​language comparison
of /​d/​–​/ð​ /​perception: Evidence for a new developmental pattern”, Journal of the
Acoustical Society of America, 109(5), pp. 2190–​2201.
Polka, L., and Werker, J. F. (1994) “Developmental changes in perception of non-​
native vowel contrasts”, Journal of Experimental Psychology: Human Perception
and Performance, 20(2), pp. 421–​435.
Shi, R. (2009) “Contextual variability and infants’ perception of tonal categories”,
Chinese Journal of Phonetics, 2, pp. 1–​9.
Shi, R., Gao, J., Achim, A., and Li, A. (2017). “Perception of lexical tones in native
Mandarin-​learning preverbal infants and toddlers”, Frontiers in Psychology, 8,
pp. 11–​17.
Shi, R., Santos, E., Gao, J., and Li, A. (2017). “Perception of similar and dissimilar
lexical tones by non-​tone-​learning infants”, Infancy, 22(6), 790–​800.
Singh, L., Hui, T. J., Chan, C., and Golinkoff, R. M. (2014) “Influences of vowel
and tone variation on emergent word knowledge: A cross-​linguistic investigation”,
Developmental Science, 17(1), pp. 94–​109.
So, C., and Best, C. (2010) “Cross-​ language perception of non-​ native tonal
contrasts: Effects of native phonological and phonetic influences”, Language and
Speech, 53(2), pp. 273–​293.
Tsao, F. M. (2008) “The effect of acoustical similarity on lexical-​tone perception of
one-​year-​old Mandarin-​learning infants”, Chinese Journal of Psychology, 50(2),
pp. 111–​124.
Tsao, F. M., Liu, H. M., and Kuhl, P. K. (2006) “Perception of native and non-​native
affricate-​fricative contrasts: Cross-​language tests on adults and infants”, Journal of
Acoustical Society of America, 120(4), pp. 2285–​2294.
Werker, J. F., Gilbert, J. H. V., Humphrey, K., and Tees, R. C. (1981) “Developmental
aspects of cross-​ language speech perception”, Child Development, 52(1), pp.
Werker, J. F., and Tees, R. C. (1984) “Cross-​language speech perception: Evidence
for perceptual reorganization during the first year of life”, Infant Behavior and
Development, 7(1), pp. 49–​63.
Yeung, H. H., Chen, K. H., and Werker, J. F. (2013) “When does native language
input affect phonetic perception? The precocious case of lexical tone”, Journal of
Memory and Language, 68(2), pp. 123–​139.
Yip, M. (2002) Tone. Cambridge: Cambridge University Press.

F0 development in Cantonese
pre-​adolescent children
Wai-​Sum Lee

12.1 Introduction
There have been a number of frequency studies of F0 (pitch) development
in children. Kent (1976), in a survey of acoustic studies of children’s speech,
shows that there is an overall drop in F0 throughout the developmental course
from infancy to adulthood. For both males and females, F0 is at its highest of
about 400–​500 Hz during the first year, and it decreases sharply to about 300
Hz over the first three years, which is followed by a gradual drop in F0 to about
250 Hz when reaching the onset of puberty at age 11 or 12. Between the two
genders, a significant difference in F0 emerges after age 11, and the difference
becomes more apparent after age 13, due to a further drop in F0 to about 100–​
150 Hz for males during the period from 13 to 17 years of age. From infancy
to adulthood, males undergo an overall decrease in F0 of approximately two
octaves, but for females it is just over an octave. Kent points out that the gen-
eral pattern of the developmental course requires careful consideration, as the
amount of the data on F0 development from different age groups is limited.
Also, the F0 data presented in past studies are not always comparable to each
other, due to differences in test material, analysis method, and number of
Eguchi and Hirsh (1969) and Lee, Potamianos, and Narayanan (1999) are
two large-​scale studies of the developmental change of speech in children
of a wide range of ages. Both analyze the acoustic properties of American
English vowels. In Eguchi and Hirsh (1969), the vowels are [i æ u ɛ a ɔ] from
84 subjects, including children aged from 3 to 13 and adults of both genders,
and in Lee, Potamianos, and Narayanan (1999), the vowels are [i ɪ ɛ æ ɑ ɔ ʌ ʊ
u ɝ] from 436 children and adolescents aged from 5 to 18 and 56 adults, with
males and females in each age group. Results of the two studies show that for
both genders, the F0 value decreases gradually with increasing age throughout
the pre-​adolescent period before age 11. In Eguchi and Hirsh (1969), a sub-
stantial drop in F0 is observed in male children from 11 to 13, which is taken
to indicate the onset of the adolescent voice change. In contrast, the F0 drop
is gradual and small in female children from 11 to 13. Furthermore, the F0 in
male children at age 13 (221.1 Hz) is about an octave higher than that in male
adults (124.2 Hz), but the difference in F0 is small between female children

332 Wai-Sum Lee

of age 13 (239.8 Hz) and female adults (220.9 Hz). Lee, Potamianos, and
Narayanan (1999) report that a substantial drop in F0 in male children occurs
between age 12 (226 Hz) and age 15 (127 Hz). The latter approximates the F0
value for male adults (134 Hz). The F0 values for female children of age 12
(231 Hz) and female adults (227 Hz) are similar. Both studies show that the
adolescent voice change ends earlier in female children than male children.
Similar findings to those in Eguchi and Hirsh (1969) and Lee, Potamianos,
and Narayanan (1999) have been reported in other cross-​sectional studies
of the F0 development in children (Curry 1940; Fairbanks, Herbert, and
Hammond 1949; Fairbanks, Wiley, and Lassman 1949; Fairbanks 1950;
Weinberg and Bennett 1971; Bennett and Weinberg 1979a, 1979b; Hasek,
Singh, and Murry 1980; Sorenson 1989; Busby and Plant 1995; Whiteside
and Hodgson 1998, 1999, 2000; Perry, Ohde, and Ashmead 2001; Baker et al.
2008; Lee and Iverson 2008, 2009). In these studies, speaker variations in the
developmental pattern of pitch are observed, as speech development is related
to the physical growth, which varies substantially across individual speakers.
There has been longitudinal research on F0 development in children’s speech,
investigating the progress of the adolescent voice change in relation to the
laryngeal growth in individual children. Bennett (1983) is a three-​year longi-
tudinal study of F0 change in American English-​speaking children, 15 males
and 10 females, from 7 to 11 years of age. Hollien, Green, and Massey (1994),
which also conducted a longitudinal study, examine the development of the
adolescent voice change in 48 American English-​speaking male children aged
10 and 11 over a period of five years. The findings in both studies are generally
similar to those in the cross-​sectional studies.
Based on the findings reported in the above-​ mentioned studies, some
generalizations of the patterns of pitch development in pre-​adolescent chil-
dren may be made. (i) There is an overall decline in F0, as age increases
throughout the pre-​adolescent period. (ii) For both male and female pre-​
adolescent children, the age-​related F0 drop is gradual and progressive. (iii)
The gender-​related difference in F0 is not apparent until the onset of the ado-
lescent voice change in male children, and the adolescent voice change is more
obvious and takes a longer time to complete in males than females. (iv) The
controversy among the previous studies lies mainly in the exact ages at which
the adolescent voice change in children begins and ends.
Nearly all of the past developmental studies of F0 change were based on
speech data from English-​speaking subjects. In English, a non-​tone language,
lexical meaning is not determined by variation in pitch. It is of interest and
significance to examine the developmental pitch pattern in a tone language,
such as Cantonese Chinese. The present study investigates the F0 (pitch)
development in Cantonese-​speaking pre-​adolescent children by analyzing the
F0 of the Cantonese tones in the speech of male and female children of the
age groups ranging from 4 to 12. The F0 data obtained are further analyzed
for the age-​and gender-​related patterns of developmental change in F0 of the
lexical tones in Cantonese. Children’s F0 data are compared to those for male

F0 development 333
and female adults. A comparison of the F0 development data of Cantonese
and those of English as reported in the previous studies will also be made to
explore the language factor in F0 development.

12.2 Method

12.2.1 Subjects
In this study, speech data were collected from a total of 100 native Cantonese
speakers, comprising 90 pre-​adolescent kindergarten or primary school chil-
dren and 10 university young adults. The children formed nine consecutive
age groups from 4 to 12 years, and in each age group there were five males and
five females. The ten adults, five males and five females, were in their early 20s.
The means and standard deviations of the ages of five children of the same
gender in each of the nine age groups are presented in Table 12.1. For male or
female children, the age difference between any two consecutive age groups is
one year plus/​minus two months, thus ranging from 10 to 14 months. Between
male and female children of the same age group, the age difference is one
or two months. All the speakers were born in Hong Kong and grew up in
a monolingual Cantonese-​speaking family. They do not have a history of
speech and hearing problems. This study passed the ethical screening process
of the Research Committee at the City University of Hong Kong and received
prior parental consent for each child participant.

12.2.2 Test material

The test material used for speech sample elicitation was a set of monosyl-
labic words in Cantonese. As stated in Zee (1999), Cantonese monosyllabic
words are associated with one of the nine citation tones, consisting of six long

Table 12.1 Means (n = 5) and standard deviations (SD) of the ages of male and
female children of nine age groups from 4 to 12 years

Age Male children Female children

Mean (year; month) SD (month) Mean (year; month) SD (month)

4 4; 6 1.87 4; 5 2.74
5 5; 8 2.49 5; 6 1.52
6 6; 6 1.41 6; 7 0.89
7 7; 6 2.49 7; 5 2.07
8 8; 8 2.17 8; 7 0.89
9 9; 5 2.59 9; 7 1.10
10 10; 5 0.89 10; 5 0.71
11 11; 4 0.89 11; 5 1.00
12 12; 4 1.34 12; 4 2.28

334 Wai-Sum Lee

tones [55 33 22 21 25 23] on (C)V (C = syllable-​initial consonant; V = vowel
or diphthong) or (C)VN (N = syllable-​final nasal) syllables and three short
tones [5 3 2] on (C)VS (S = syllable-​final stop) syllables. In this study, the test
monosyllabic words associated with the long tones are [pa55] ‘father’, [pha33]
‘fear’, [ha22] ‘below’, [pha21] ‘to climb’, [ta25] ‘to hit’, and [ma23] ‘horse’, and
those associated with the short tones are [khak5] ‘card’, [pat3] ‘eight’, and
[pak2] ‘white’. All the test words contain the same vowel [a]‌to rule out any
possible intrinsic pitch effect of the vowels (Whalen and Levitt 1995). The
test words are commonly used in everyday conversations in the Cantonese-​
speaking speech community of Hong Kong, and they are familiar to the chil-
dren in the study. In this chapter, only the results of the F0 analysis of the
three Cantonese long level tones [55 33 22] are presented.

12.2.3 Data collection and analysis

A randomized list of the test words in Chinese characters was prepared for
the elicitation of speech data. For younger children, pictures illustrating the
meanings of the test words were placed alongside the Chinese characters.
Digital audio recordings of the test words were performed in a quiet or
sound-​treated room at local schools. Three repetitions of the reading list were
recorded of each speaker, using Tascam HD-​P2, a portable high-​resolution
(up to 192 kHz at 24-​bit) digital recorder. The subjects were instructed to
utter the test words one by one at a normal rate of speech and in a consistent
degree of loudness.
The digitized speech data were down-​sampled to 12,000 Hz for F0 ana-
lysis of the tones on the test words, using the pitch-​synchronized F0 tracing
program available on KayPENTAX CSL (Computerized Speech Lab) 4500
speech analysis software. The F0 values (in Hz) were obtained at the time
point 50 percent into the durations of the 15 tokens (3 repetitions x 5 speakers
of the same age and gender group) of each Cantonese level tone [55], [33], or
[22]. The F0 values of the 15 tokens of each tone were subsequently averaged
for determining the difference in mean F0 of each of the three Cantonese
tones across the children of the nine different age groups and the two gender
groups and also between children and adults.

12.3 Results
This section presents the mean or average F0 value for each of the Cantonese
level tones [55 33 22] uttered by the children of the nine age groups and adults,
male and female. The mean F0 values (See Table 12.2) are compared (i) among
children of the nine age groups from 4 to 12 years, (ii) between children of
each age group and adults of the same gender, and (iii) between the two
genders within each age group and across the age groups. Statistical analyses,
ANOVA and t-​test, were performed to determine the significance level of the
between-​group differences in F0.

F0 development 335

12.3.1 Developmental F0 change in pre-​adolescent children at the ages of

4–​12 years
Figure 12.1, which is based on the said mean F0 values, shows the develop-
mental F0 change in the tones [55 33 22] for male (upper panel) and female
(lower panel) children of the nine age groups from 4 to 12 years and adults
in early 20s. As can be seen in the two graphs, for both genders, there is an
overall downward trend in F0 for all the three tones, as the age increases. The
overall F0 change in each tone is significant across the nine age groups of chil-
dren, both male (p < .001) and female (p < .001).
For children from age four to age ten, similar patterns of the F0 change in
the three tones as a function of age are observed for the two genders. Some

Male speakers


F0 (in Hz)

100 22

4 5 6 7 8 9 10 11 12 20s
Age group

Fe ale speakers


F0 (in Hz)

100 22

4 5 6 7 8 9 10 11 12 20s
Age group

Figure 12.1 Developmental (mean) F0 change in the Cantonese tones [55 33 22] for
male (upper panel) and female (lower panel) children at 4 to 12 years of
age and adults in early 20s

336 Wai-Sum Lee

generalizations can be made. First, the highest F0 values are at four and five
years of age. Second, between the ages of five and six, there is a significant (p <
.01) drop in F0, which is taken to indicate the transition from early childhood
to middle childhood. Third, from age six to age ten, the drop in F0 is slight
and gradual, with no significant F0 difference between any two consecutive
ages, male or female.
For children of the two older groups of 11 and 12, the difference in F0
between the two age groups is larger in males than females. For male children,
there is a significant (p < .001) drop in F0 for each of the three tones, which
is taken to indicate the onset of the adolescent voice change. For female chil-
dren, the F0 change is insignificant for all the three tones, which means the
adolescent voice change in female children at ages 11 and 12 is only slight
compared to male children at the similar ages.

12.3.2 F0 difference between children and adults

A comparison of the mean F0 values (in Hz) for children and those for adults
of the same gender as presented in Table 12.2 shows that for any tone there
is a large difference in F0 value across male children of the nine age (4 to
12) groups (ranging from 196 Hz to 287 Hz for [55], 171 Hz to 254 Hz for [33],
and 163 Hz to 246 Hz for [22]) and male adults (134 Hz for [55], 113 Hz for
[33], and 101 Hz for [22]), and that the difference for each of the three tones
is significant (p < .001). The data suggest that the adolescent voice change in
male children goes beyond the age of 12. The difference in F0 value for any
tone is also significant (p < .05) between female children at 4 to 10 (ranging
from 269 Hz to 317 Hz for [55], 232 Hz to 278 Hz for [33], and 217 Hz to 265
Hz for [22]) and female adults (257 Hz for [55], 215 Hz for [33], and 191 Hz
for [22]). However, for older female children at 11 and 12, the F0 values of the
three tones (ranging from 253 Hz to 262 Hz for [55], 220 Hz to 226 Hz, and
204 Hz to 211 Hz for [22]) are similar to those for female adults, and there are
no significant differences in F0 between female children and female adults for
the tones [55] and [33]. This shows that the adolescent voice change continues
after age 11 for male children but not female children.
The F0 difference between children and adults of the same gender is also
expressed in ratios of the mean F0 values for children of the nine age groups to
those for adults of the same gender. As presented in Table 12.3, the F0 ratios
of children of any age group to adults are markedly larger for males than for
females. For the three tones, the F0 ratios of male children at 4 to 12 years
of age to male adults range from 1.5 to 2.4. The F0 is thus about 0.5 to 1.5
octaves higher for male children than for male adults. As for the F0 ratios of
female children to female adults, they are lowered to a range of 1.0 to 1.4 for
the three tones. For any tone, the F0 ratios of female children at 9–​12 years
of age to female adults are about 1.0 to 1.1, indicating that the F0 difference
between female children of that age range and female adults is minimal.

F0 development 337
Table 12.2 The mean F0 values (in Hz) of the Cantonese tones [55 33 22] for male
and female children at 4 to 12 years of age and adults in early 20s

Age Male F0 Female F0

[55] [33] [22] [55] [33] [22]

4 285.73 246.08 236.27 317.02 278.10 265.13

5 286.93 254.02 246.36 310.60 274.21 256.61
6 262.05 235.29 218.16 276.72 248.89 232.64
7 260.39 229.95 217.40 271.38 247.10 233.96
8 260.44 229.26 215.18 278.16 239.88 224.48
9 252.56 222.88 209.55 277.16 233.89 217.42
10 259.08 236.08 221.58 268.65 231.75 218.87
11 249.06 221.10 208.99 261.85 225.67 211.07
12 195.79 170.86 162.92 252.60 220.24 204.24
20s 134.49 112.76 101.49 256.99 215.48 191.41

Table 12.3 Ratios of the mean F0 values (in Hz) of the Cantonese tones [55 33 22]
for children at 4 to 12 years of age to those for adults in early 20s of the
same gender

Age F0 ratios of F0 ratios of

group male children to male adults female children to female adults

[55] [33] [22] [55] [33] [22]

4 2.124 2.182 2.328 1.234 1.291 1.385

5 2.133 2.253 2.427 1.209 1.273 1.341
6 1.948 2.087 2.149 1.077 1.155 1.215
7 1.936 2.030 2.142 1.056 1.147 1.222
8 1.936 2.033 2.120 1.082 1.113 1.173
9 1.878 1.977 2.065 1.079 1.085 1.136
10 1.926 2.094 2.183 1.045 1.076 1.143
11 1.852 1.961 2.059 1.019 1.047 1.103
12 1.456 1.515 1.605 0.983 1.022 1.067

12.3.3 F0 difference between males and females

Table 12.4 presents the F0 ratios of female children to male children of each
age group and the F0 ratios of female adults to male adults for the tones [55
33 22]. As presented in the table, the F0 ratio of females to males is larger than
1.0 in nearly all the cases, indicating that overall the F0 is higher for females
than males in adults and children. For adults, the gender difference in F0 is
large and significant (p < .001) for each of the three tones, where the F0 is 1.9
times or about one octave higher for female adults than male adults. As for
children below age 12, the F0 ratios of females to males are in the range of 1.0

338 Wai-Sum Lee

Table 12.4 F0 ratios of females to males of each of the age groups, 4–​12 and early
20s, for the Cantonese tones [55 33 22]

Age F0 ratios of females to males

[55] [33] [22]

4 1.110 1.130 1.122

5 1.082 1.079 1.042
6 1.056 1.058 1.066
7 1.042 1.079 1.076
8 1.068 1.046 1.043
9 1.097 1.049 1.038
10 1.037 0.982 0.988
11 1.051 1.021 1.010
12 1.290 1.289 1.254
20s 1.911 1.911 1.886

to 1.1 for the three tones, indicating only a slight difference in F0 between the
two genders. At age 12, the F0 ratios of females to males increase to a range of
1.2 to 1.3 for the three tones, and the F0 differences between the two genders
are significant for the tones [55] (p < .001), [33] (p < .001), and [22] (p < .01).
The F0 data suggest the gender difference in children’s voice emerges at the
age of 12 years.

12.3.4 F0 differences among the three Cantonese level tones [55 33 22]
Similar patterns of the differences in F0 of the three Cantonese level tones [55
33 22] are observed between the children of any age group and adults, male
or female. For all the speakers, as expected, the F0 is higher for [55], followed
by [33] and [22] in decreasing order, and there is a larger F0 difference or tonal
space between [55] and [33] than between [33] and [22]. Table 12.5 presents the
F0 ratios of [55] to [33] and [33] to [22] for male and female children at the ages
of 4–​12 years and for male and female adults. For children of the nine age
groups, the F0 ratios of [55] to [33] range from 1.10 to 1.16 for males and 1.10
to 1.19 for females, that is, the F0 is 10 percent to 16 percent (males) or 10 per-
cent to 19 percent (females) higher for [55] than [33]; the F0 ratios of [33] to [22]
range from 1.03 to 1.08 for males and 1.05 to 1.08 for females, that is, the F0
is 3 percent to 8 percent (males) or 5 percent to 8 percent (females) higher for
[33] than [22]. Thus, the F0 difference is about twice larger between [55] and [33]
than between [33] and [22], which corresponds to the tone space differences as
suggested in the tone letters [55 33 22]. A similar pattern of differences in F0 of
the three level tones is also observed for the adults, though the F0 ratios of [55]
to [33], that is, 1.19 for both male and female adults, and the F0 ratios of [33] to
[22], that is, 1.11 for male adults and 1.13 for female adults, are slightly higher
than the F0 ratios for children. The data indicate that children at four years of

F0 development 339
Table 12.5 F0 ratios of the Cantonese tones [55] to [33] and [33] to [22] for male and
female children at 4 to 12 years of age and adults in early 20s

Age Male F0 ratios Female F0 ratios

[55] to [33] [33] to [22] [55] to [33] [33] to [22]

4 1.161 1.041 1.140 1.049

5 1.130 1.031 1.133 1.069
6 1.114 1.079 1.112 1.070
7 1.137 1.053 1.098 1.056
8 1.136 1.065 1.160 1.069
9 1.133 1.064 1.185 1.076
10 1.097 1.065 1.159 1.059
11 1.126 1.058 1.160 1.069
12 1.146 1.049 1.147 1.078
20s 1.193 1.111 1.193 1.126

age have acquired the adult-​like pattern of tonal space in Cantonese. Thus, the
tonal spaces among the three Cantonese level tones [55 33 22] are maintained,
irrespective of the difference in absolute F0 values of the tones.

12.4 Discussion
The general patterns of the age-​related and gender-​related developmental F0
change in Cantonese-​speaking children presented above are similar to those
in English-​speaking children reported in the previous cross-​sectional studies
of the F0 development, such as Eguchi and Hirsh (1969) and Lee, Potamianos,
and Narayanan (1999). To demonstrate the cross-​language similarities and
differences in F0 development, the F0 data on the English vowel [ɑ] from pre-​
adolescent children at 5–​12 years of age and young adults of age 18 presented
in Lee, Potamianos, and Narayanan (1999) are compared with the F0 data
on the comparable Cantonese vowel [a]‌in the present study. As the youngest
English children in Lee, Potamianos, and Narayanan (1999) are five years
old, the F0 data from Cantonese children at age four are not included for
Figure 12.2 shows the superimposed developmental F0 curves for male
(upper panel) and female (lower panel) speakers of Cantonese and English.
The curves are plotted based on the mean F0 values for the English [ɑ] and
those for the three Cantonese tones [55 33 22] on the vowel [a]‌from children
of the same age and gender groups of the two languages. As shown in the
figure, the F0 values for English children are similar to the F0 values of the
high-​level tone [55] for Cantonese children. This is true for male and female
children of the eight age groups, except the age group of the five-​year-​olds,
where the F0 value for English children is much closer to the F0 value of the
tone [33] than the tone [55] in Cantonese.

340 Wai-Sum Lee

Male speakers



F 0 (in Hz)

150 33

5 6 7 8 9 10 11 12 18/20s
Age group
Fe ale speakers



F 0 (in Hz)

150 33

5 6 7 8 9 10 11 12 18/20s
Age group

Figure 12.2 Developmental change in the (mean) F0 values (in Hz) of the English [ɑ]
(Lee, Potamianos, and Narayanan 1999) and the Cantonese [a]‌associated
with each one of the three tones [55 33 22] for male (upper panel) and
female (lower panel) children at 5 to 12 years of age and adults aged 18
or in early 20s

A comparison of the F0 data for English children with the F0 data on

the tone [55] for Cantonese children of the eight age groups from 5 to 12
and adults shows that for both languages, the F0 of male or female children
tends to decrease progressively as the age increases, with the smallest F0 value
at age 12. Overall, the decrease in F0 as a function of age throughout the
pre-​adolescent years from 5 to 12 appears to be more gradual for English
children than Cantonese children, male and female. A noticeable difference
between the two languages is the marked F0 drop between age five and age six
in Cantonese but not English. The F0 drop in Cantonese children is taken to

F0 development 341
Table 12.6 F0 ratios of children to adults of the same gender for Cantonese and

Age Cantonese children to adults F0 ratio English children to adults F0 ratio

Male Female Male Female

5 2.133 1.209 2.145 1.124

6 1.948 1.077 2.202 1.095
7 1.936 1.056 2.137 1.161
8 1.936 1.082 1.992 1.128
9 1.878 1.079 2.056 1.103
10 1.926 1.045 2.073 1.087
11 1.852 1.019 2.016 1.021
12 1.456 0.983 1.879 0.967

Source: Lee, Potamianos, and Narayanan 1999.

indicate the onset of voice change from early childhood to middle childhood.
Such a voice change is assumed to take place before age five for English chil-
dren. Another difference between Cantonese and English is the abrupt F0
drop from age 11 to age 12 for Cantonese male children, which is taken as an
indication of the onset of adolescent voice change. The drop in F0 between
age 11 and age 12 for English male children is to a lesser degree, suggesting
that the adolescent voice change in English children has yet to start. There is
a lack of a similar F0 drop for female children of both languages from 11 to
12, suggesting that the similar voice change in male children does not occur in
female children at ages 11 and 12.
The F0 values for children and those for adults of the same gender are
compared for Cantonese and English. Table 12.6 presents the F0 ratios of
children of each age group to adults of the same gender for both languages.
The Cantonese data are based on the mean F0 values for the tone [55] on the
vowel [a]‌and the English data on the mean F0 values for the vowel [ɑ] reported
in Lee, Potamianos, and Narayanan (1999). As presented in the table, for both
Cantonese and English, the difference in F0 between male adults and male
children of any age groups is large, with the F0 ratio of children to adults close
to 2.0 in all but a single case. Thus, the F0 for male children is approximately
an octave or twice higher than that for male adults. A large drop in the F0 ratio
of male children at age 12 to male adults is observed for Cantonese (down
to 1.456) but not English (down to 1.879). The difference is assumed to be
due to the disparity in the onset time for the adolescent voice change in male
children between the two languages. For both languages, the F0 difference is
small between female children and female adults, with the ratios of children
to adults ranging from 1.0 to 1.2. At ages 11 and 12, for both Cantonese and
English, the F0 ratio of female children to female adults is close to 1.0, indi-
cating a similarity in voice between female children and female adults. Thus,

342 Wai-Sum Lee

Table 12.7 F0 ratios of females to males of the same age group for Cantonese and

Age Female to male F0 ratio

Cantonese English

5 1.082 1.023
6 1.056 0.971
7 1.042 1.060
8 1.068 1.105
9 1.097 1.047
10 1.037 1.023
11 1.051 0.988
12 1.290 1.004
20s 1.911 1.952

Source: Lee, Potamianos, and Narayanan 1999.

for both languages the developmental F0 change is smaller for female children
and ends earlier in female children than male children.
The gender-​ related F0 changes between the two languages are also
compared, based on the mean F0 values for the Cantonese tone [55] on the
vowel [a]‌in the present study and the F0 values of the English vowel [ɑ] from
Lee, Potamianos, and Narayanan (1999). Table 12.7 presents the F0 ratios of
females to males of each age group for the two languages. As presented in the
table, the F0 ratios of female children to male children are just about 1.0–​1.1
in most cases for Cantonese and English. This indicates that the gender diffe-
rence in F0 in pre-​adolescent children of the two languages is minimal, though
the F0 value tends to be slightly larger for female than male children of the
same age group. A noticeable difference is the marked increase in the female-​
to-​male ratio at age 12 for Cantonese (1.290) but not English (1.004). The
data indicate that the emergence of a gender distinction in F0 or pitch occurs
earlier in Cantonese children than English children. Between male and female
adults, as expected, there is a large gender difference in F0 for both Cantonese,
with the females to males F0 ratio of 1.911, and English, with the females to
males F0 ratio of 1.952. This indicates that the F0 for female adults is close to
twice the F0 for male adults, whether the speakers are of a tone or non-​tone

12.5 Conclusion
This chapter has presented the age-​related and gender-​related developmental
patterns of the F0 values of the Cantonese level tones [55 33 22] in pre-​
adolescent children, male and female. The significant findings of this study
are summarized as follows. First, the large F0 drop for all three Cantonese
tones between ages five and six for children of both genders suggests the onset

F0 development 343
of physical maturation, including laryngeal growth and the lengthening of the
vocal folds. Second, the large drop in F0 in Cantonese male children between
ages 11 and 12 marks the onset of the adolescent voice change. Third, the
relative differences in F0 value between the tones [55] and [33] and between the
tones [33] and [22] are maintained in children, male and female, at different
ages and in male and female adults, showing that for the tones to be linguis-
tically distinguishable, the tonal spaces are maintained despite the changed
absolute F0 values. Fourth, the similarity in the F0 developmental pattern
between Cantonese and English pre-​adolescent children shows that (i) the
difference in the prosodic system, with Cantonese being a tone language and
English a stress language, does not appear to have an effect on the F0 devel-
opmental change, and (ii) may suggest a possible universal of F0 development
in children.

Baker, S., Weinrich, B., Bevington, M., Schroth, K., and Schroeder, E. (2008) “The
effect of task type on fundamental frequency in children”, International Journal of
Pediatric Otorhinolaryngology, 72(6), pp. 885–​889.
Bennett, S. (1983) “A 3-​year longitudinal study of school-​aged children’s fundamental
frequencies”, Journal of Speech and Hearing Research, 26(1), pp. 137–​141.
Bennett, S., and Weinberg, B. (1979a) “Sexual characteristics of preadolescent
children’s voices”, Journal of the Acoustical Society of America, 65(1), pp. 179–​189.
Bennett, S., and Weinberg, B. (1979b). “Acoustic correlates of perceived sexual identity
in preadolescent children’s voices”, Journal of the Acoustical Society of America,
66(4), pp. 989–​1000.
Busby, P. A., and Plant, G. L. (1995). “Formant frequency values of vowels produced
by preadolescent boys and girls”, Journal of the Acoustical Society of America,
97(4), pp. 2603–​2606.
Curry, E. T. (1940) “The pitch characteristics of the adolescent male voice”, Speech
Monographs, 7(1), pp. 48–​62.
Eguchi, S., and Hirsh, I. J. (1969) “Development of speech sounds in children”, Acta
Oto-​laryngologica, Supplementum, 257, pp. 1–​51.
Fairbanks, G. (1950) “An acoustical comparison of vocal pitch in seven-​and eight-​
year-​old children”, Child Development, 21(2), pp. 121–​129.
Fairbanks, G., Herbert, E. L., and Hammond, J. M. (1949) “An acoustical study of
vocal pitch in seven-​and eight-​year-​old girls”, Child Development, 20(2), pp. 71–​78.
Fairbanks, G., Wiley, J. H., and Lassman, F. M. (1949) “An acoustical study of vocal
pitch in seven-​and eight-​year-​old boys”, Child Development, 20(2), pp. 63–​69.
Hasek, C. S.,, Singh, S., and T. Murry (1980) “Acoustic attributes of preadolescent
voices”, Journal of the Acoustical Society of America, 68(5), pp. 1262–​1265.
Hollien, H., Green, R., and Massey , K. (1994) “Longitudinal research on adolescent
voice change in males”, Journal of the Acoustical Society of America, 96(5), pp.
Kent, R. D. (1976) “Anatomical and neuromuscular maturation of the speech mech-
anism: Evidence from acoustic studies”, Journal of Speech and Hearing Research,
19(3), pp. 421–​447.

344 Wai-Sum Lee

Lee, S. B., Potamianos, A., and Narayanan, S. (1999). “Acoustics of children’s
speech: Developmental changes of temporal and spectral parameters”, Journal of
the Acoustical Society of America, 105(3), pp. 1455–​1468.
Lee, S. Y., and Iverson, G. K. (2008) “The development of monophthongal vowels
in Korean: Age and sex differences”, Clinical Linguistics and Phonetics, 22(7), pp.
Lee, S. Y., and Iverson, G. K. (2009) “Vowel development in English and Korean:
Similarities and differences in linguistic and non-​ linguistic factors”, Speech
Communication, 51(8), pp. 684–​694.
Perry, T. L., Ohde, R. N., and Ashmead, D. H. (2001) “The acoustic bases for gender
identification from children’s voices”, Journal of the Acoustical Society of America,
109(6), pp. 2988–​2998.
Sorenson, D. N. (1989) “A fundamental frequency investigation of children ages 6–​
10 years old”, Journal of Communication Disorders, 22(2), pp. 115–​123.
Weinberg, B., and Bennett, S. (1971). “Speaker sex recognition of 5-​and 6-​year-​
old children’s voices”, Journal of the Acoustical Society of America, 50(4), pp.
Whalen, D. H., and Levitt, A. G. (1995). “The universality of intrinsic F0 of vowels”,
Journal of Phonetics, 23(3), pp. 349–​366.
Whiteside, S. P., and Hodgson, C. (1998) “The development of fundamental frequency
in 6-​to 10-​year old children: A brief study”, Journal of the International Phonetic
Association, 28(1/​2), pp. 55–​62.
Whiteside, S. P., and Hodgson, C. (1999) “Acoustic characteristics in 6-​10-​year-​old
children’s voices: Some preliminary findings”, Logopedics Phoniatrics Vocology,
24(1), pp. 6–​13.
Whiteside, S. P., and Hodgson, C. (2000) “Some acoustic characteristics in the voices
of 6-​to 10-​year-​old children and adults: A comparative sex and developmental per-
spective”, Logopedics Phoniatrics Vocology, 25(3), pp. 122–​132.
Zee, E. (1999) “Chinese (Hong Kong Cantonese) –​IPA Illustration” in International
Phonetic Association (ed.) Handbook of the International Phonetic Association.
Cambridge: Cambridge University Press, pp. 58–​60.

13 The positional effects of contour tones

in second language Chinese1
Hang Zhang

13.1 Introduction
Research on second language (L2) acquisition in the past few decades has
shown that, although one’s first language (L1) has an effect on second lan-
guage sound patterns (i.e., “L1 transfer”), interlanguage grammars are also
restricted by some universal phonetic or phonological principles (Broselow
et al. 1998; Major 2001; Echman 2004, among others). Studies on the L2
acquisition of Modern Standard Chinese (hereafter, Chinese) tones has
uncovered some patterns that appear to be derived not from learners’ native
language nor from the target language. Rather, such error patterns often
reveal universally preferred structures (H. Zhang 2010, 2016). However, it is
still unclear exactly how universals can be accessible to adult L2 learners and
the extent to which specific phonetic mechanisms or phonological principles
shape the interlanguage grammar. The current study on L2 Chinese contour
tones attempts to 1) show how the position within a word affects production
accuracy, and 2) demonstrate a possible correlation between L2 tones and a
cross-​linguistically common phonetic mechanism of tonal coarticulation, and
to inspire further research on this topic.
Contour tones are regarded as more marked (complex) than level tones
(Ohala 1978; J. Zhang 2002). Contour tones also show a higher degree of
vulnerability than level tones in L2 studies (H. Zhang 2010, 2016), and this
motivates the current study on further exploration of contour tones, in par-
ticular the positional effects, in L2 Chinese. H. Zhang (2015) investigates
how the position of a tone within a clause affects the production of L2
tones by examining the performance of all four Chinese tones at initial and
final positions of various prosodic units within sentences (referred to here as
“sentential” positional effects). However, the research on intertonal effects
of L2 tones, that is, how the tones affect one another in connected speech
thereby decreasing the intelligibility of non-​native tones, remains under-
developed. Of particular interest in this study are anticipatory coarticu-
lation and carry-​over coarticulation. Anticipatory coarticulation involves
one speech sound being influenced by subsequent sounds and carry-​over
coarticulation involves one speech sound being influenced by preceding

346 Hang Zhang

Chang and Yao (2016) compare tone productions in context as produced
by heritage learners, L2 learners, and native speakers of Chinese, but focuses
on the effect on duration rather than other key acoustic properties devoted
to the intelligibility of non-​native tones. Bent (2005) and Yang (2015) both
investigate the production of L2 tones when embedded in tri-​syllabic words.
However, in Bent (2005), there is only one fixed trisyllabic model, thereby
limiting the study’s ability to determine the effects of various contexts on tone
production. B. Yang (2015) looks at both heritage learners’ and non-​heritage
learners’ perception and production of Chinese tones in the middle of tone
strings. The test tone (T) is embedded in four contexts: preceding a tone with
either a high (H) or a low (L) starting point, and following a tone with either a
H or L endpoint (i.e., LTH, HTH, HTL, and LTL). Yang (2015) finds that L2
learners have more difficulty producing tones in contour-​dependent contexts
(particularly HTL) than in level contexts (particularly LTL and HTH).
However, the design of the trisyllable model, with the test tone simultaneously
influenced by the preceding and following context tones, cannot separate the
contribution of anticipatory effects from that of carry-​over effects.
The current study surveys the general positional effects and intertonal
effects of contour tones, T2 and T4, in disyllabic words rather than trisyllables.
For the general positional effects, this study investigates how the word-​initial
position and word-​final position affect the accuracy rates of T2 and T4. For
the intertonal effects, this study attempts to determine if and how the two types
of coarticulation, especially the anticipatory coarticulation, affect the intelligi-
bility of the L2 contour tones. This study will examine the tonal productions
of 60 learners of Chinese with different L1 backgrounds. Each of the three
L1s, namely, American English, Tokyo Japanese, and Seoul Korean (hence-
forth English, Japanese, and Korean), represents one of three different types
of non-​tonal languages according to the characteristics of their word prosody
(Jun 2005): stress-​accent languages (English), lexical-​pitch-​accent languages
(Japanese), and non-​stress and non-​lexical-​pitch-​accent languages (Korean).
Coincidentally, speakers of English, Japanese, and Korean make up the
majority of the population of L2 learners of Chinese worldwide (Hu 2008).
I will first review studies on relevant tone characteristics in both native
and non-​native Chinese in Section 13.2. The design of the experiment will be
explained in Section 13.3. In Section 13.4, having observed the general pos-
itional effects of L2 learners producing T2 and T4 in disyllabic words, any
intertonal effects upon T2 and T4, in particular the anticipatory coarticu-
lation, will be examined. Section 13.5 will offer a discussion followed by some
concluding remarks made in Section 13.6.

13.2 Background and predictions

This section briefly introduces the prosodic structures of the L1s (English,
Korean, and Japanese), the L2 (Chinese), and the ‘universal’ Tonal Markedness
Scale, in order to make hypotheses and predictions regarding the difficulty
hierarchy and positional effects in L2 contour tones.

Positional effects of contour tones 347

13.2.1 Pitch in Japanese, Korean, and English

Pitch serves a contrastive function in Chinese but not in non-​tonal languages.
Essentially, pitch contour is specified mainly at the phrasal or sentence level in
non-​tonal languages by means of a complex interplay between metrical struc-
ture, prosodic phrasing, syntax, and pragmatics. At the lexical level, the lexical
accent in English regularly appears as stress with corollary spectral effects such
as the adjustment of pitch shapes. That is, pitch contours differ depending upon
the stress patterns present, either trochaic or iambic in the English lexicon.
According to Delattre (1964), about 74 percent of English words are trochaic.
In Japanese, a “lexical pitch accent” language, pitch features operate
at both the lexical and phrasal/​sentential levels. It shares features of tonal
languages in that pitch is to some degree lexically associated. Japanese words
can be accented or unaccented, with accented words making up about half
of the vocabulary. The placement of the accent in some cases is lexically con-
trastive. For example, the sequence hashi spoken in isolation can be accented
in two ways: either háshi (accented on the first syllable, meaning ‘chopsticks’)
or hashí (flat or accented on the second syllable, meaning either ‘edge’ or
‘bridge’). Importantly, however, Japanese has only a relatively small number
of these minimal pairs differing in accent placement, which is very different
from tonal languages. Beyond this, Japanese pitch patterns are mainly speci-
fied at the phrasal and sentence levels. For further information on Japanese
prosody, see Kubozono (2011) and Venditti (2005).
At the word level, Korean is considered to be a “non-​stress and non-​pitch-​
accent” language, since fundamental frequency peaks and valleys of Korean
intonation are not linked to any specific syllable of a word but rather to a
word’s location in a phrase. According to Jun (1996)’s model, the typical
pattern of the Korean Accentual Phrase (AP) (which is analogous to phono-
logical phrases in English) is Low-​High-​Low-​High or High-​High-​Low-​High,
where the AP-​initial tone is determined by the laryngeal feature of the phrase
initial segment. For further information, see Jun (1996, 2005).
Following Pike (1948), Goldsmith (1976), and Pierrehumbert (1980), the
phonetic pitch contours of English intonation are treated as a series of level
pitches, such as high (H) and low (L). This study assumes the same for the other
two non-​tonal languages involved in this study, Japanese and Korean.2 It also
is assumed that English, Japanese, and Korean speakers do not have prece-
dent for Chinese lexical contour tones in which H and L pitch targets behave
as a single unit (as a whole tone) dominated by a single Tone Bearing Unit
(TBU) and contrast word meanings (Bao 1999). Based on this assumption,
the Chinese contour tones, T2 and T4, are new linguistic structures for L2
learners and are both difficult for the learners.

13.2.2 Chinese lexical tones

Chinese is a typical tone language: A change of tone can cause a change of
word meaning, making tone acquisition crucial for adult Chinese learners.

348 Hang Zhang

Following Yip (2002), this study assumes that the default form of Tone 3 is
a low-​level tone phonologically.3 For the sake of transparency, the four tones
of Chinese are abbreviated as H (T1), LH (T2), L (T3), and HL (T4) in this
chapter. Disyllabic words are believed to be the most frequent type of word
in Chinese, and all possible combinations of the four basic lexical tones occur
in Chinese words, with the exception of T3-​T3 (L-​L), due to the well-​known
Tone 3 sandhi process. This means that T2 and T4 are freely distributed in
Chinese. As mentioned above, since there is no precedent of lexical tones for
the L2 learners in their native languages, rising tones (T2) and falling tones
(T4) would be both difficult for speakers of non-​tonal languages learning
As for the positional effects, according to the research investigating the
realization of contour tones (J. Zhang 2004) and the “right prominence”
characteristics of Chinese languages (Yip 1980; Chen 2000), contour tones
at prosodic unit final positions may be more resistant to change than at
initial positions. Therefore, if L2 Chinese learners learn the positional prop-
erties of the Chinese tone system, the contour tones at word or phrase final
position may be produced at a greater rate of accuracy than those at word-​
initial positions in L2 Chinese. Studies investigating L2 tones produced by
English speakers partially confirm this hypothesis, finding T4 to be produced
accurately more often at word-​final positions than at initial positions of
disyllabic words (Broselow, Hurting, and Ringen 1987; H. Zhang 2010). It
is not clear whether this pattern holds true for the other contour tone in
Chinese, T2. The current study will survey the general positional patterns
of both T2 and T4.
In Chinese, the phonetic pitch contours of lexical tones are also influenced
by neighboring tones, or tone coarticulation. Coarticulation refers to the phe-
nomenon whereby a given speech sound is altered in its phonetic manifest-
ation due to influences from adjacent sounds. Coarticulation is not only found
in segments but also in lexical tones of Chinese (Shen 1990; Xu 1994, 1997,
and others) and other Asian tonal languages, such as Thai tones (Gandour
et al. 1992; Gandour et al. 1994; Potisuk et al. 1997) and Vietnamese (Han
and Kim 1974; Brunelle 2003). Coarticulation has often been assumed to
be an automatic consequence of speech physiology (Chomsky and Halle
1968). Both anticipatory coarticulation and carry-​over coarticulation effects
occur in tonal languages, although most studies show that carry-​over coar-
ticulation effects are greater than anticipatory coarticulation effects. Further
studies on tone coarticulation have found that, while carry-​over effects are
mostly assimilatory, anticipatory effects tend to be of a dissimilatory nature
(Gandour et al. 1994; Xu 1994, 1997; Potisuk et al. 1997). When carry-​over
assimilation occurs, the F0 of a tone will assimilate to the offset value of the
previous tone. However, when anticipatory dissimilation occurs, the F0 of a
tone will diverge from the onset of the following tone. That is, a low onset
value of a tone raises the maximum F0 value of a preceding tone (“Pre-​low
Raising”), and a high onset value of a tone lowers the maximum F0 value of

Positional effects of contour tones 349

a preceding tone (Gandour et al. 1994; Xu 1997). Furthermore, this dissimi-
latory effect is most clearly seen when the first syllable has a tonal value of T2
or T4 in Chinese (Xu 1997).
Tone coarticulation effects, as discussed in the literature on native tonal
languages, exist only at the phonetic level. In native Chinese, pitch contours
of the affected tones are only slightly revised, while canonical pitch contours
stay intact so that native listeners can still perceive them as belonging to the
same tone category. However, this study hypothesizes that coarticulation may
not only affect the phonetics of L2 contour tones but also make their phono-
logical identities of the affected tones unintelligible. In other words, the phon-
etic variations of contour tones will be greater in L2 tones than in native
Chinese tones, to the point of making their tone identities unrecognizable to
native Chinese listeners, thus classifying them as tone errors.
To account for carry-​over coarticulation, I focus on the test tones (that is,
T2 and T4) located in bisyllabic word-​final positions. Following the carry-​
over mechanisms discussed in Gandour et al. (1994) and Xu (1997), the onset
of T2 and T4 should be lower when followed by tones with low offsets (i.e.,
T3 and T4) than when followed by tones with high offsets (i.e., T1 and T2). It
is thus expected the accuracy rates of T2 preceded by T3 or T4 (both with L
offsets) to be higher than T2s preceded by T1 and T2 (both with H offsets).
Similarly, the accuracy rates of T4 following T1 and T2 (with H offsets) would
be higher than those of T4 following T3 and T4 (with L offsets).
Xu (1997) has two important observations of anticipatory coarticulation
effects in native Chinese tones. First, the anticipatory effect is greatest in the
earlier portions of the F0 contour for T4, and greatest in the final portions for
T2. The maximum F0 of word-​initial T2 or T4 is raised. This suggests that the
H component pitch target of T2 (LH) and T4 (HL) is the component most
sensitive to anticipatory effects. Second, the H component tone of a word-​
initial T2 or T4 is highest when the final tone is T3, the tone with the lowest
onset F0. The second-​strongest effect comes from a final T2. Therefore, the
maximum F0 values of affected contour tones followed by T3 and T2 (both
of which have low onsets) are higher than those contours followed by T4 and
T1 (both of which have high onsets).
To account for the influence of possible anticipatory coarticulation on L2
tone phonology, I focus on the test tones T2 and T4 at word-​initial positions
followed by various tones. It is expected that, if the anticipatory coarticu-
lation mechanism significantly influences the F0 contour of contour tones
in L2 Chinese, the maximum F0 of T2 and T4 (i.e., the offset of the rising
tone T2 and the onset of the falling tone T4) would be lower when followed
by tones with high onsets (i.e., T1 and T4), than when followed by tones with
low onsets (i.e., T2 and T3). As a result, the accuracy rates of T2 followed
by T2 and T3 (both with L onsets) would be greater than those of T2 when
followed by T1 and T4 (both with H onsets). Likewise, the accuracy rates of
T4 followed by T2 and T3 (both with L onsets) would be higher than those
of T4 followed by T1 and T4 (both with H onsets), as shown in Table 13.1.

350 Hang Zhang

Table 13.1 Potential influence of anticipatory coarticulation on T2 and T4
accuracy rates

Initial syllable Final syllable Predicted accuracy

rate changes of T2 and
T4 at initial syllables

(a) T2 (LH, ending-​H raised) T2 (LH) or T3 (L) Increased

(b) T2 (LH, ending-​H lowered) T1 (H) or T4 (HL) Decreased
(c) T4 (HL, beginning-​H raised) T2 (LH) or T3 (L) Increased
(d) T4 (HL, beginning-​H lowered) T1 (H) or T4 (HL) Decreased

A few L2 studies investigating disyllabic words have found that learners of

tonal languages exhibit possible anticipatory effects. Wang (1995) observed
the performance of six English-​speaking learners producing four L2 Chinese
tones across 80 disyllabic words. In most cases of T2 errors, the target T2s
were mistakenly produced as low tones when followed by T1 or T4. Wang
(1997) observed a similar pattern among both English and Japanese speaking
learners of Chinese. The rising tones (T2 in the case of Chinese) preceding
tones with a high onset (T1 and T4) are produced poorly by L2 learners.
However, Wang (1995, 1997) only focused on the error patterns of T2 and
did not consider the effect on the other contour tone, T4. The present study
thus will examine the intertonal effects on both T2 and T4. In particular,
I will examine 1) the error or accuracy rates to determine if the coarticulation
mechanisms affect the intelligibility of L2 contour tones; 2) the F0 values of
T2 offsets and T4 onsets to check whether coarticulation also affect the L2
contour tones phonetically; and 3) the error types of the affected T2 and T4 to
check whether the learners misproduce the contour tones into low tones due
to coarticulation effects.

13.2.3 The Tonal Markedness Constraint

Studies on L2 phonology have shown that a L2 learner’s productions can be
influenced by the learner’s native language and the target language grammars
(Major 2001, 2008, others). In addition, L2 sound patterns can also be
influenced by universal processes such as markedness (see Eckman 2008 for
a review of related literature). Markedness concerns universal preferences in
languages for certain forms or features and has been found to play a role in L2
phonology (Eckman 1991, 2008).
The Tonal Markedness Scale (TMS), defined as *Rise >> *Fall >>
*Level (Ohala 1978; Hyman and VanBik 2004), states that rising tones are
generally more marked than falling tones and that falling tones are more
marked than level tones. This phonetically grounded Tonal Markedness Scale
is not only reflected in the tone inventories of natural languages (J. Zhang
2002) but also in the dynamic process of language acquisition. It is reported
that Chinese-​speaking children produce level tones and falling tones earlier
than rising tones (Li and Thompson 1977; Zhu and Dodd 2000). As with

Positional effects of contour tones 351

infants acquiring Chinese as a first language, adults acquiring Chinese as a
second language also show effects of the Tonal Markedness Scale. Studies that
have focused on the order of acquisition of Chinese tone types for L2 learners
have reported various findings (see a review in Sun, 1998 and H. Zhang 2013).
However, the majority of these studies have found that L2 learners acquire
the high-​level tone first (i.e., with the fewest errors), and the high falling tone
earlier than the rising and/​or dipping tones.
While T2 and T4 are predicted to be equally difficult based on the prosodic
structures of L1s and L2, the Tonal Markedness Constraint predicts that L2
learners of Chinese may produce T4 at a greater rate of accuracy than T2 in
L2 Chinese.

13.3 Methods
In order to survey the performance of L2 contour tones in disyllabic words,
a phonological study consisting of a pre-​ test and a main experiment is
designed. The pre-​test, a reading task of 48 monosyllabic morphemes, was
used to ensure that all participants were able to produce individual lexical
tones correctly.

13.3.1 Participants
Sixty-​seven learners participated in the pre-​test. Seven were excluded from
participation in the main experiment due to low accuracy rates in the pre-​
test (below 85 percent). Among the 60 participants in the main experiment,
20 were native English speakers (12 males and 8 females), 20 were native
Japanese speakers (10 males and 10 females) from areas with Tokyo-​type
pitch accent, and 20 were native Korean speakers from Seoul (8 males and
12 females). All participants had been learning Chinese for at least 6 months,
but no more than 18 months at the time of data collection, placing them at
approximately an intermediate level. The learners were recruited from a US
university on the East Coast and a university in China. All participants
claimed that Chinese was the only tonal language they had studied. Learners
participated in this study voluntarily, and each was paid for the recording

13.3.2 Procedure
For the main experiment, participants were given a list of sentences and were
asked to produce each in Chinese at a normal speed. Stimuli for the main
experiment were disyllabic words bearing all 16 possible combinations of the
four lexical tones. Although this study pays particular attention to T2 and T4,
it also includes the remaining lexical tones in order to obtain a broader per-
spective of tone performance, as well as for comparative purposes. Each of
the 16 possible tone combinations was presented in equal proportions. Two
words (consisting of different morphemes) for each tone combination type

352 Hang Zhang

were used, resulting in 32 distinct words. All words were at the lowest profi-
ciency level according to the General Outline of the Chinese Vocabulary Levels
and Graded Chinese Characters (HSK Department 1992). The morphemes
were selected based on the following criteria:

1 The syllables in the test words cannot be neutralized or undergo tone

sandhi rules in spoken Mandarin (with the exception of T3-​T3 sequences).
2 Stimuli are limited to nouns or verbs so that the test words can serve as an
attributive modifying a following noun in a sentence. No function words
are used.
3 The use of obstruents in test words is kept to a minimum, while the use of
sonorants is maximized so that pitch tracking can be continuous.
4 All syllables are CV(V) in structure without a coda.

The test words were embedded in sentences. The test words were used as
modifiers to modify nouns in the sentences. In order to avoid anticipatory
and carry-​over effects from neighboring tones (Xu 1997), the tokens were
embedded in sentences where the preceding and following morphemes were
both the neutral-​toned particle de. This way, any effect from the neutral tone
would be the same for all test tokens. In addition, these test words were placed
in a sentence-​medial position to reduce the possible interference of sentence
intonation. Example (5) displays the carrier sentence structure:

(5) Test sentence:

Chinese 我 觉得 X X 的 东西 很 好.
Pinyin: Wŏ juéde X X de dōngxi hĕn hăo.
Gloss: I think X X particle things very good
‘I think XX things are very good.’

The 32 sentences were repeated twice, resulting in 64 sentences in total for

each trial. Sentences were randomly ordered in a reading list. Participants
were provided with Pinyin transcriptions and Chinese characters, as well as
the English, Japanese, or Korean translations of each sentence in the reading
lists. The recording was paused between pages of the reading list. Participants
were recorded with Version 5.2.17 of Praat (Boersma and Weenink 2011) in
a soundproof recording booth. Altogether, participants produced 7,680 test
tokens (32 words × 2 syllables × 2 repetitions × 60 participants).

13.3.3 Analysis
The correctness of L2 tonal production was judged within sentences.
The author, a native speaker of Chinese, judged whether or not the tonal
productions were acceptable by both listening to productions and measuring
pitches in Praat. Productions were marked as “correct” or “incorrect”. A tone

Positional effects of contour tones 353

was considered to be an error if either the contour (rise, fall, level) and/​or
register (pitch range, either high or low) of the tone was incorrect. For all
incorrect productions, the participants’ actual tonal production was also
written down. The incorrect tones were classified as either those that do occur
in Chinese (“within inventory”), or those that do not (“out of inventory”). It
turns out that most of the errors fell within the Chinese tone inventory.
In order to guarantee the reliability of correctness judgments and the
transcriptions of the incorrect tones, both intra-​and interrater agreements
were calculated. All tonal productions were judged and transcribed twice by
the author, with a one-​month interval in between judgments. The agreement
rate between these two judgments was 95.6 percent. For the inter-​transcriber
reliability, two other native speakers (referred to here as L and C) were hired to
judge and transcribe one-​tenth of the data independently. Both L and C were
native speakers of Chinese and have received training in linguistics. A pair-
wise comparison test indicates that the author and L had the highest level of
agreement (93.8 percent), that the author and C the second-​highest level of
agreement (92.6 percent), while C and L had the lowest level of agreement
(91.6 percent). The SAS statistical package was used for the following stat-
istical analyses. The significance criterion adopted for declaring a significant
difference is p < 0.05.

13.4 Results
This section reports the results obtained from the experiment with Section
13.4.1 focusing on general error patterns of T2 and T4 in disyllabic words and
Section 13.4.2 on intertonal effects, with particular attention paid to anticipa-
tory effects. Although individual differences are found in L2 tone productions,
I chose to look at these learners as a group when examining the results, in
order to posit generalizations concerning the positional effects in L2 tones.

13.4.1 General positional effects on T2 and T4 in disyllabic words

The SURVEYFREQ procedure was used to determine the number of clusters
and observations in the experiment. Figure 13.1 shows the error rates of each
tone in the three L1 groups. Note that the relative rankings of error rates for
each tone within each L1 data set are very similar. In general, T2 is produced
with a higher rate of error than T4 and the error rate of T4 is higher than T1.
This is consistent with the hypothesis based on the Tonal Markedness Scale.
To survey the general positional effects, null hypothesis of equal error rates
(no association) between word-​initial versus word-​final positions was tested
using Rao-​Scott Chi-​Square Test. The results in Table 13.2 show that in both
the general dataset and in each L1 group, the error rates of T2 in word-​initial
positions are significantly lower than in word-​final positions, and the error
rates of T4 in word-​initial positions are significantly higher than in word-​final
positions. The first column in Table 13.2 contains the tone names, and the

354 Hang Zhang

Figure 13.1 General error rates

following columns display the error rate information of the overall dataset
and that of the individual language groups, with the percentages of errors
in word-​initial positions at left followed by errors in word-​final positions.
Significant differences (p < 0.05) between the error rates in different positions
are highlighted in bold and are in shaded cells.
The most striking positional effects are found in the productions of T2
and T4. The word-​initial T2 has a much lower error rate than word-​final
T2, and the error rate of word-​final T4 is significantly lower than word-​
initial T4. This finding holds true across all three groups of speakers, and is
confirmed when examining the types of substitute tones participants used
when they made errors. In the study of L2 acquisition, substitution is not an
arbitrary process but instead stems from a process of avoidance and choice
by L2 learners. In the current dataset, it is found that 1) T4 is used more
often than T2 as a substitute tone for other tones when errors occurred;
and 2) T4 is substituted significantly more often for other tones at word-​
final positions, while T2 is usually substituted for other tones at word-​initial
positions. Table 13.3 lists the substitution patterns with positional infor-
mation. The first line contains the name of the language groups and the
total number of substitute tones in each group. The first column lists the
substitute tones, T2 and T4, employed by L2 learners when the target tones
were incorrectly produced. In each cell, the counts of the substitute tones
followed by their percentages are listed. The percentages under each L1 are
out of all substitute tones in that L1 data set. That is, the percentages are
out of total error numbers in specific L1 sets.
Findings indicate that positional errors are negatively correlated with pos-
itional substitution rates in most of the L2 productions. Where the error rates
for a tone are high, the use of that tone as a substitute for other tones is
low, and vice versa. Across all three groups of learners, the rising tone T2 is
“disfavored” in word-​final positions, and the falling tone T4 is “disfavored”
Table 13.2 Error patterns with positional information

Tone General English Japanese Korean

Initial Final Initial Final Initial Final Initial Final

T1 18.02% 25.42% 24.38% 30% 15.31% 14.38% 14.38% 31.88%

0.0567 0.4383 0.8847 p = 0.005
T2 33.33% 79.9% 36.88% 78.13% 27.81% 82.19% 35.31% 79.37%
p < .0001 p < .0001 p < .0001 p < .0001
T3 40.56% 37.08% 52.08% 42.81% 37.08% 29.06% 32.5% 39.38%
0.4992 0.3618 0.3368 0.3939
T4 44.17% 22.29% 54.69% 28.44% 40% 16.56% 37.81% 21.88%
p < .0001 p = 0.0022 p < .0001 p = 0.0193

356 Hang Zhang

Table 13.3 Substitutions with positional information

Substitute English/​1098 errors Japanese/​821 errors Korean/​931 errors

initial final initial final initial final

T2 102(9%) 1(0%) 59(7%) 4(0%) 12(1%) 2(0%)

T4 51(5%) 132(12%) 17(2%) 71(9%) 42(5%) 146(16%)

in word-​initial syllables. This pattern is also found in a sentence-​level experi-

ment reported in H. Zhang (2015), who investigates the performance of L2
Chinese tones in varying prosodic units including prosodic words, phono-
logical phrases, intonation phrases, and clauses.

13.4.2 Intertonal effects

The previous section finds that T2 is produced more accurately word-​initially
than word-​finally, and that T4 is produced more accurately word-​finally than
word-​initially. However, if looking closer at the environments in which errors
made, there appear to be a few exceptions to this overall pattern. Participants
seem to exhibit two common intertonal effects. First, although in general
word-​initial T2s are performed significantly better than at word-​final T2,
word-​initial T2 preceding T1 (and T4) is produced poorly. Second, although
in general word-​initial T4 is produced with a greater rate of error than word-​
final T4, word-​initial T4 is produced remarkably better when it is followed by
T3 than when it is followed by other tones. Although these two findings are
largely overshadowed by the general T2-​T4 positional patterns found in the
previous section, they are no less interesting. Accuracy rates of word-​initial T2 and T4

Figure 13.2 displays the accuracy rates of T2 in both word-​initial and word-​
final positions of disyllabic words. In each L1 group, the first four bars
indicate the accuracy rates for word-​initial T2s, whereas the right four bars
indicate the accuracy rates for word-​final T2s. The left four bars are higher
than the right four, which reflects the better performance of word-​initial T2s
in general, as discussed in previous section. For all three groups of speakers,
the relative rankings of the left four bars are very similar: T2 followed by T3
always has the highest accuracy rate (see arrows), followed by the accuracy
rate of word-​initial T2 preceding another T2, followed by the accuracy rates
of T2 followed by T4 and T1. This finding is consistent with the ranking
of F0 values of T2 offsets when followed by T3, T2, T4, or T1 observed in
Xu’s (1997) study, although his participants were all native Chinese speakers.
Unlike the effect on native Chinese tones in which coarticulation affects only

Positional effects of contour tones 357

2 1
2 2
2 3
2 4
30.00% 1 2
20.00% 2 2
10.00% 3 2
0.00% 4 2
English Speakers Japanese speakers Korean speakers

Figure 13.2 Accuracy rates of T2 in various tone sequences

4 1
4 2
50.00% 4 3
40.00% 4 4
30.00% 1 4
20.00% 2 4
3 4
4 4
English Speakers Japanese Speakers Korean Speakers

Figure 13.3 Accuracy rates of T4 in disyllabic words

the phonetic level, anticipatory coarticulation decreases the intelligibility of

L2 contour tones when they are followed by high-​onset tones.
The rankings of the right four bars in each L1 group, that is, the error
rate patterns of word-​final T2, are different from each other. This indicates
that the T2 accuracy patterns fit more with the predictions about anticipatory
coarticulation effects than those of carry-​over effects.
Figure 13.3 presents the error patterns of T4s in both word-​initial and
word-​final positions of disyllabic words. Similarly to Figure 13.2, in each L1
group, the first four bars indicate the accuracy rates of word-​initial T4s. Note
that they are generally lower than the four bars at the right, which represent
the accuracy rates of word-​final T4s. The rankings among the left four bars

358 Hang Zhang

are more consistent with each other than those of right four bars, which
indicates a clearer anticipatory effect than carry-​over effect. Across all types
of speakers, the accuracy rate of word-​initial T4 when it is followed by T3 is
remarkably higher than other word-​initial T4s (see arrows in Figure 13.3).
Two observations can be made from Figures 13.2 and 13.3. The similar
accuracy rate patterns of contour tones at word-​initial positions across the
three groups of speakers appear to favor the predictions made based on
anticipatory effects. That is, the ranking of T2 accuracy rates and the higher
accuracy rate of T4 preceded by T3 found across the three groups of speakers
match up with the expected error patterns based on anticipatory coarticu-
lation. However, by examining the error patterns of T2 and T4 at word-​final
positions as shown in Figures 13.2 and 13.3, no clear carry-​over-​related
patterns were found in the current study. Second, anticipatory coarticulation
is observed in both T2 and T4 sequences, but the influence of anticipatory
coarticulation on T2 is greater than that on T4. In each group, all accuracy
rate comparisons of T2 at various positions suggest anticipatory effects, but
some rankings of accuracy rates of T4 do not support anticipatory effects
(i.e., Korean T4 (T1) versus T4 (T2), T4 (T4) versus T4 (T2), and Japanese T4
(T4) versus T4 (T2)).
Considering the minor carry-​over effects observed in the error patterns, the
remainder of this chapter focuses on anticipatory effects only. F0 values of affected T2 and T4

This section is concerned with whether anticipatory coarticulation is also
reflected in the phonetics of those L2 productions of T2 and T4 that are
judged to be correct. The measurements of phonetic variations of affected T2
and T4 show that the maximum F0 value patterns parallel the findings of the
previous section.
Tables 13.4 and 13.5 display the average F0 of offsets of T2 and onsets of
T4 in the correct L2 tonal productions. Due to the fact that male and female
speakers have very different pitch ranges, the F0 values of male and female
speakers are displayed separately. In T2 productions (in Table 13.4), when T2

Table 13.4 Average F0 values of T2 offsets in correct productions

Target Tones T2-​(T1) T2-​(T2) T2-​(T3) T2-​(T4)

English speakers Male 108 Hz 128 Hz 126 Hz 101 Hz

Female 185 Hz 224 Hz 221 Hz 170 Hz
Japanese speakers Male 104 Hz 161 Hz 152 Hz 109 Hz
Female 218 Hz 255 Hz 260 Hz 196 Hz
Korean speakers Male 98 Hz 127 Hz 122 Hz 94 Hz
Female 190 Hz 236 Hz 236 Hz 186 Hz

Positional effects of contour tones 359

on the first syllable of a word is followed by a tone with a low onset (i.e., T2
or T3), the F0 of the offset of T2 is always higher than when it is followed by
a tone with a high onset (i.e., T1 or T4). Both female and male speakers’ max-
imum F0 values of the correct productions of T2 are fully consistent with the
argument that anticipatory coarticulation plays a role in L2 tones.
However, when T4 on the first syllable of a word is followed by a tone with
a low onset (i.e., T2 or T3), not all the F0 values of the onsets of T4 are higher
than when it is followed by a tone with a high onset (i.e., T1 and T4).
Table 13.5 shows that 1) the average onsets of T4 when followed by T3
are always higher than when T4 is followed by T1, across all three groups
of speakers; and 2) for the English speakers (both male and female) and
Japanese female speakers, the onset F0 of T4 followed by T2 and T3 is higher
than when T4 is followed by T1 and T4. This pattern echoes the rankings of
accuracy rates of T4 as presented in Figure 13.2, which are less consistent
than those of T2. The inconsistent rankings of T4 onsets may be partially
due to the consonants at the beginning of the T4 syllables. Rather than the
relatively better-​preserved pitch contour through voiced sounds at the end
of T2-​bearing syllables (all test syllables ended with vowels), the onsets of
T4 are difficult to detect due to the obstruent (non-​sonorant) consonants.
Further studies using stimuli that are made up of only sonorants are required
to accurately test for this effect in the future. Error types of affected contour tones

This section investigates possible anticipatory effects appearing in the form of
erroneous tone productions. If anticipatory coarticulation affects L2 contour
tones, the H components of T2 and T4 would be lowered due to anticipatory
coarticulation (see the case (b) and (d) in Table 13.1). That means, T2 and T4
would be erroneously produced as low tones (T3) more than they are errone-
ously produced as other tones when followed by tones with high onsets (i.e.,
T1 and T4).
To present a broader view of the substitutions, Table 13.6 lists the three
most frequently produced response tone sequences (including both correct

Table 13.5 Average F0 values of T4 onsets in correct productions

Target Tones T4-​(T1) T4-​(T2) T4-​(T3) T4-​(T4)

English speakers Male 150 Hz 181 Hz 187 Hz 143 Hz

Female 242 Hz 264 Hz 264 Hz 251 Hz
Japanese speakers Male 171 Hz 186 Hz 193 Hz 183 Hz
Female 249 Hz 244 Hz 255 Hz 246 Hz
Korean speakers Male 158 Hz 164 Hz 175 Hz 167 Hz
Female 247 Hz 274 Hz 270 Hz 271Hz
Table 13.6 The top three disyllabic response tones for target T2 (LH) at initial positions

Targets T2-​T1 T2-​T4

English Japanese Korean English Japanese Korean

Response tones T2-​T1 (25%) T2-​T1 (53%) T2-​T1 (33%) T2-​T4 (43%) T2-​T4 (61%) T2-​T4 (51%)
T3-​T1 (19%) T3-​T1 (15%) T3-​T4 (23%) T3-​T4 (18%) T3-​T4 (20%) T3-​T4 (26%)
T1-​T1 (13%) T1-​T1 (15%) T3-​T1 (20%) T2-​T3 (18%) T1-​T4 (6%) T3-​T1 (8%)

Targets T4-​T1 T4-​T4

English Japanese Korean English Japanese Korean
Response tones T2-​T1 (38%) T4-​T1 (46%) T4-​T1 (43%) T4-​T4 (38%) T4-​T4 (53%) T4-​T4 (49%)
T1-​T1 (18%) T1-​T1 (25%) T1-​T1 (15%) T3-​T4 (19%) T3-​T4 (9%) T1-​T4 (13%)
T3-​T1 (16%) T3-​T1 (6%) T4-​T4 (14%) T2-​T4 (11%) T1-​T3 (6%) T4-​T3 (9%)

Positional effects of contour tones 361

Table 13.7 Statistical analyses of error type comparisons for T2-​T1, T2-​T4, T4-​T1,
and T4-​T4

Target Errors English group Japanese group Korean group ALL

T2-​T1 T3-​T1 vs. T1-​T1 p = 0.2471 p=1 p = 0.0197 p = 0.045

T2-​T4 T3-​T4 vs. T1-​T4 p = 0.0029 p = 5.00E-​04 p < 0.0001 p < 0.0001
T3-​T4 vs. T4-​T4 p < 0.0001 N/​A p = 1.00E-​04 p < 0.0001
T4-​T1 T3-​T1 vs. T1-​T1 p = 0.5928 p < 0.0001 p = 1.00E-​04 p < 0.0001
T3-​T1 vs. T2-​T1 p = 0.0216 p=1 N/​A p = 0.0213
T4-​T4 T3-​T4 vs. T1-​T4 p = 0.0816 p = 0.0653 p = 0.5045 p = 0.0911
T3-​T4 vs. T2-​T4 p = 0.0816 p = 0.0048 p = 4.00E-​04 p < 0.0001

and incorrect productions) and the percent at which they were produced out
of all response tones for target T2-​T1 and T2-​T4 sequences (i.e., word-​initial
T2 followed by tones with high onsets) and target T4-​T1 and T4-​T4 (i.e.,
word-​initial T4 followed by tones with high onsets). All the learners produced
erroneous T3-​T1, T3-​T4, T3-​T1, and T3-​T4 sequences more often than other
errors in instances when they were intending to produce T2-​T1, T2-​T4, T4-​
T1, and T4-​T4 sequences. T3 errors are boldfaced when they are widely used
as substitutions for the target T2 and T4 at word-​initial positions.
A likelihood ratio test was used to determine whether T3 is the most fre-
quently produced error for target T2 and T4 at word-​initial positions, when
followed by T1 or T4 (with the word-​final tones correct). For example, if
the target tone sequence is T2-​T4, the number of times T3-​T4 was actu-
ally produced was compared with the number of times T1-​T4 or T4-​T4 was
produced. Table 13.7 displays the results of statistical analyses. Clearly, when
all language groups are pooled together (the ‘All’ column in Table 13.7), T3
is produced for target word-​initial T2 and T4 significantly more often than
other erroneous tones. The data support the hypothesis that L2 contour tones
are constrained by the mechanism of anticipatory coarticulation.
All three language groups of L2 learners of Chinese erroneously produced
T3 most frequently when the target tone was a word-​initial T2 or T4 followed
by tones with high onsets. As was the case for the previous two sections, this
was seen most clearly for T2, with results from T4 being less consistent.

13.5 Discussion

13.5.1 General positional effects and the TMS effect in L2 tones

In terms of general positional effects, the data of the present study only
partially support the hypothesis made based on the “right prominence”
characteristics of the Chinese language. In L2 Chinese, T2 is performed much
better at word-​initial positions than at word-​final positions, but T4 has a much
higher accuracy rate at word-​final positions than at word-​initial positions.

362 Hang Zhang

The general error rates of T2 are significantly higher than T4 across the three
L2 groups, which indicates the effect of the Tonal Markedness Scale. This is
also reflected in the T2-​T4 asymmetry observed in the word-​initial positions
when they are affected by the anticipatory coarticulation. This asymmetry
will be discussed in the next section.
Falling tones are argued to be phonetically easier to perceive and produce
than rising tones according to the Tonal Markedness Scale. Falling and rising
asymmetry has been discussed in phonetic studies and in some survey studies
in the literature of tonal phonology (Ohala 1978; J. Zhang 2002, 2004; Hyman
and VanBik 2004). Falling tones are allowed in many more languages than
rising tones –​even within a language, rising tones are often more restricted in
distribution than falling tones. Therefore, it is not surprising that in this study,
T2 is performed more poorly than T4 in word-​final positions. The perform-
ance of T2 in word-​initial positions is superior to that in word-​final positions.
This is compatible with a prevailing prosodic pattern that cross-​linguistically
declarative sentences often begin with a low or rising tone and end with a
falling or low tone (Ladd 2008). This pitch pattern with a rising pitch contour
utterance-​initially and a fading or falling pitch contour utterance-​finally has
also been found in L2 Vietnamese tones (Nguyen and Macken 2009). In L2
Vietnamese tones made by English speaking learners, the “error probability
was low when the [rising] tone was preceded by a pause, which suggests that
participants were more likely to pronounce rising […] correctly when it was
not preceded by any tone” (Nguyen and Macken 2009: 70). The situation was
found to be reversed for the glottalized falling tone because, as the authors
state, “it was most likely to be mispronounced when preceded by a pause”.

13.5.2 Tonal coarticulation effects in L2 contour tones

This study also observed some intertonal effects. The adult leaners’ Tone 2
(mid-​rising tone) and, to some extent, their Tone 4 (high-​falling tone) tend
to be less intelligible to native listeners when followed by tones starting with
a high onset (T1 or T4). By examining the accuracy rates, pitch values, and
error types of contour tones, the experiment provides evidence in the results
that while carry-​over effects are minor, anticipatory coarticulation effects
can be seen in various aspects of non-​native contour tone productions, espe-
cially for the case of T2. Anticipatory tonal effects have been studied less than
carry-​over ones, likely because anticipatory effects are often inconspicuous
and difficult to detect. This section highlights three asymmetric features
concerning anticipatory coarticulation found in this study that I believe would
be of interest for future studies: an asymmetry in triggering environments, an
asymmetry in effect size between T2 and T4, and an asymmetry in effect size
between carry-​over coarticulation and anticipatory coarticulation.
This L2 Chinese tone study confirms claims made in studies of native
Chinese tones that only the H(igh) component of contour tones is sensitive
to anticipatory coarticulations (Gandour et al. 1994; Xu 1997). A diagram of

Positional effects of contour tones 363

T2: T4:

(a)L H LH ( )H L HL

( )LH HL LH H ( )H L H H L HL

Figure 13.4 The effect of anticipatory dissimilation on T2 and T4

anticipatory dissimilation on T2s (cases (a) and (b)) and T4s (cases (c) and
(d)), is shown in Figure 13.4.
In cases (a) and (c), the H component is pushed even higher due to the L
component tones in the following syllable. This is a “pre-​low raising” effect.
This kind of “enhancement” of H in target contour tones usually does not
influence the preservation of target tone identities. However, in cases (b) and
(d), the H component tones in T2 and T4 are lowered, triggered by the H
in the following syllable. This decreases the intelligibility of the L2 word-​
initial T2 and T4 tones for native Chinese listeners, resulting in more errors
in L2 tones.
If we consider all of the cases in Figure 13.4 to be dissimilatory in nature,
the T2 patterns (i.e., (a) and (b)) indicate an immediate dissimilation since the
changes occur on contiguous tones. However, the T4 patterns (i.e., (c) and
(d)) are examples of distance dissimilation since the onset of the final syllable
triggers the contour changes of the beginning H component of the initial
syllable but not the immediate L component. Interestingly, more evidence for
(a) and (b) were found than for (c) and (d) in the present L2 study. That is,
while anticipatory coarticulation affects both T2 and T4 to similar degrees
in native Chinese (Xu 1997), in L2 tones it seems to affect T2 more easily
than T4. This is surprising given that T2 and T4 are equally grammatical and
are freely distributed in native Standard Chinese. In addition, there is no sig-
nificant difference between the frequency of T2 and T4 in the basic vocabu-
lary of Standard Chinese (Shang 2000). Since L2 learners do not have any
preexisting knowledge of lexical tones in their L1s, T2 and T4 are equally dif-
ficult for them. The asymmetry of T2 and T4 found in the present study thus
invites an analysis from the perspective of universal phonetic or phonological
constraints, such as the Tonal Markedness Scale.
The poorer performance and later acquisition of rising tones (T2) than
falling tones (T4) by both L1 and L2 learners have been observed in previous
studies (Li and Thompson 1977; Zhu and Dodd 2000; H. Zhang 2010, 2013).

364 Hang Zhang

This study found a similar asymmetry, in that T2 was found to be more sus-
ceptible to dissimilation than T4 was. The greater effects of coarticulation in
T2 sequences than in T4 sequences may involve the realization of tone targets
and perceptual recoverability of tones. Generally speaking, the syllable rhyme
is more important for the realization of tone contrasts than the syllable onset
is, because it is the part of the syllable where the periodic sound contains
more harmonics, and therefore the location within the syllable where pitch
is better discriminated (J. Zhang 2004; Flemming 2011). Anticipatory effects
influence more tone offsets at word-​initial positions. It is always the offsets of
T2 that are altered due to anticipatory effects, but it is the onsets of T4 that
are altered. This may result in T2 at word-​initial positions being more difficult
to perceive by native listeners.
The discussion above regarding the realization of tone targets is also related
to another asymmetric feature found in this study: More anticipatory effects
are observed than carry-​over effects in L2 tones. In tone languages, phono-
logical tone spreading almost always spreads rightward (due to carry-​over
effects), whereas the leftward spread of tones (due to anticipatory effects)
is very rare (Hyman and Schuh 1974; Chen 2000; Hyman 2007; Flemming
2011). According to Flemming (2011):

The rightward bias arises because it is more important to realize tone

targets in the syllable rhyme than in the onset so the deviations from tone
targets that arise during transitions between tones are located during syl-
lable onsets rather than syllable rhymes as far as possible.
(Flemming 2011: 5)

However, in L2 tones the failure to realize target tone offsets of the first syl-
lable by L2 learners may lead to higher error rates, resulting in more notice-
able anticipatory effects compared to carry-​over effects.

13.6 Conclusion
This study surveys the error patterns of non-​native T2 and T4 in disyllabic
words made by 60 learners of Chinese with different L1 background. It is
found that T2 is produced with a higher rate of accuracy at word-​initial
positions than in word-​final positions, while T4 is produced with a higher rate
of accuracy at word-​final positions than in word-​initial positions. However,
some interesting intertonal effects that run contrary to this general finding
were also observed. T4 is performed noticeably better than other T4s when it
is followed by T3. The accuracy rates of T2 when followed by T3 and when
followed by another T2 are always higher than when T2 is followed by T1
or T4. By further examining the phonetic variation of correct productions
of T2 and T4 at correspondent positions and the error types for incorrect
productions, this study argues that the error grammar of L2 contour tones is

Positional effects of contour tones 365

constrained by cross-​linguistically common tonal coarticulation mechanisms,
in particular, anticipatory coarticulation.
Research concerning anticipatory coarticulation is still undeveloped,
being less studied than carry-​over coarticulation in the linguistic literature.
This is in part because carry-​over effects are predominantly strong, whereas
anticipatory coarticulation is difficult to detect in native tonal productions.4
L2 research may provide an ideal testing ground for further explorations
of anticipatory coarticulation mechanisms. It is hoped that the discussion
developed here will motivate further work in this line to fully understand the
properties of Chinese contour tones.

1 This chapter is developed from a pilot study which was presented at the First
International Conference on Prosodic Studies: Challenges and Prospects (ICPS-​1),
13–​14 June 2015, at the Tianjin Normal University, China.
2 The Tone Bearing Unit (TBU) in English and Korean is usually assumed to be a
syllable, but the TBU in Japanese is the mora (Gussenhoven 2004; Venditti 2005).
3 The underlying form of Tone 3 is under debate. Traditionally, Tone 3 is taken to be
underlyingly a low dipping tone (tone value [214]) that is only realized in isolated
syllables and in prosodic-​final positions. The other T3 allotone, a low-​level tone
(tone value [21] or [11]), has a much wider distribution. Mei (1977), Yip (1980,
2002), and other studies argue that the underlying form of Tone 3 is [21], and Zhang
(2014) proposes that [214] is the intonation form of Tone 3.
4 A striking exception is the substantial anticipatory coarticulation in Kinyarwanda
(Myers, 2003).

Bao, Z.-​M. (1999) “Tonal contour and register harmony in Chaozhou”, Linguistic
Inquiry, 30(3), pp. 485–​493.
Bent, T. (2005) Perception and production of non-​ native prosodic categories.
Unpublished doctoral diss., Northwestern University.
Boersma, P., and Weenink, D. (2011) Praat: Doing phonetics by computer [Computer
program]. Version 5.2.17. Retrieved November 2011,
Broselow, E., Chen, S., and Wang, C. (1998) “The emergence of the unmarked in
second language phonology”, Studies in Second Language Acquisition, 20(2),
pp. 261–​280.
Broselow, E., Hurtig, R., and Ringen, C. (1987) “The perception of second lan-
guage prosody” in Ioup, G., and Weinberger, S. (eds.) Interlanguage phonology.
Cambridge, MA: Newbury House Publishers, pp. 350–​362.
Brunelle, M. (2003) Coarticulation effects in Northern Vietnamese Tones. MS thesis,
University of Ottawa.
Chang, C. B., and Yao, Y. (2016) “Toward an understanding of heritage prosody:
Acoustic and perceptual properties of tone produced by heritage, native, and second
language speakers of Mandarin”, Heritage Language Journal, 13(2), pp. 134–​160.

366 Hang Zhang

Chen, M. Y. (2000) Tone sandhi: Patterns across Chinese dialects. Cambridge: Cambridge
University Press.
Chomsky, N., and Halle, M. (1968) The sound pattern of English. New York:
Harper & Row.
Delattre, P. C. (1964) “Comparing the phonetic features of English, German, Spanish
and French”, International Review of Applied Linguistics, 2, pp. 71–​97.
Eckman, F. R. (1991) “The Structural Conformity Hypothesis and the acquisition
of consonant clusters in the interlanguage of ESL learners”, Studies in Second
Language Acquisition, 13, pp. 23–​41.
Eckman, F. R. (2004) “From phonemic differences to constraint rankings: Research
on second language phonology”, Studies in Second Language Acquisition, 26,
pp. 513–​549.
Eckman, F. R. (2008) “Typological markedness and second language phonology” in
Edwards, J. G. H., and Zampini, M. L. (eds.) Phonology and second language acqui-
sition, vol. 36. Amsterdam: John Benjamins, pp. 95–​115.
Flemming, E. (2008) “The role of pitch range in focus marking”. Slides from a talk
given at the Workshop on Information Structure and Prosody, Studiecentrum
Flemming, E. (2011) “The grammar of coarticulation” in Embarki, M., and Dodane,
C. (eds.) La Coarticulation: Indices, Direction et Representation. Paris: L’Harmattan.
Gandour, J., Potisuk, S., and Dechongkit, S. (1994) “Tonal coarticulation in Thai”,
Journal of Phonetics, 22, pp. 477–​492.
Gandour, J., Potisuk, S., Dechongkit, S., and Ponglorpisit, S. (1992) “Tonal coarticu-
lation in Thai disyllabic utterances: A preliminary study”, Linguistics of the Tibeto-​
Burman Area, 15, pp. 93–​110.
Goldsmith, J. (1976) “An overview of autosegmental phonology”, Linguistic Analysis,
2, pp. 23–​68.
Gussenhoven, C. (2004) The phonology of tone and intonation. Research Surveys in
Linguistics. Cambridge: Cambridge University Press.
Han, M. S., and Kim, K. (1974) “Phonetic variation of Vietnamese tones in disyllabic
utterances”, Journal of Phonetics, 2, pp. 223–​232.
HSK Department, Chinese Government. (1992) General outline of the Chinese vocabu-
lary levels and graded Chinese characters. Beijing: Beijing Language Institute
Hu, Z. L. (2008) “Zuowei waiyu de hanyu jiaoxue [Teaching Chinese as a foreign lan-
guage]”, Foreign Language Education in China, 1(2), pp. 3–​12.
Hyman, L. (2007) “Universals of tone rules: 30 years later” in Gussenhoven, D.,
and Riad, T. (eds.) Tones and tunes, vol.1: Studies in Word and Sentence Prosody.
Berlin: Mouton de Gruyter.
Hyman, L., and Schuh, R. G. (1974) “Universals of tone rules: Evidence from West
Africa”, Linguistic Inquiry, 5, pp. 81–​115.
Hyman, L., and VanBik, K. (2004) “Directional rule application and output problems
in Hakha Lai tone”, Language and Linguistics, 5(4), pp. 821–​861.
Jun, S.-​A. (1996) The phonetics and phonology of Korean prosody. New York;
London: Garland Publishing.
Jun, S.-​A. (2005) Prosodic typology. Oxford: Oxford University Press, pp. 201–​230.
Kubozono, H. (2011) “Japanese pitch accent” in van Oostendorp, M., Ewen, C.
J., Hume, E., and Rice, K. (eds.) The Blackwell companion to phonology, vol.
5. Malden, MA, and Oxford: Wiley-​Blackwell, pp. 2879–​2907.

Positional effects of contour tones 367

Ladd, D. R. (2008) Intonational phonology. 2nd edition. Cambridge: Cambridge
University Press.
Leben, W. (1973) Suprasegmental phonology. Ph.D. Diss., Massachusetts Institute of
Li, C., and Thompson, S. (1977) “The acquisition of tone in Mandarin-​speaking chil-
dren”, Journal of Child Language, 4, pp. 185–​199.
Major, R. (2001) Foreign accent: The ontogeny and phylogeny of second language phon-
ology. Mahwah, NJ: Lawrence Erlbaum Associates.
Major, R. (2008) “Transfer in second language phonology: A review” in Edwards,
J. G. H., and Zampini, M. L. (eds.) Phonology and second language acquisition.
Amsterdam: John Benjamins, pp. 63–​94.
McCarthy, J. (1986) “OCP effects: Gemination and antigemination”, Linguistic
Inquiry, 17, pp. 207–​264.
Mei, T.-​L. (1977) “Tones and tone sandhi in 16th century Mandarin”, Journal of
Chinese Linguistics, 5, pp. 237–​260.
Myers, S. (2003) “F0 timing in Kinyarwanda”, Phonetica, 60, pp. 71–​97.
Nguyen, H., and Macken, M. (2009) “Factors affecting the production of Vietnamese
tones”, Studies of second language acquisition, 30, pp. 49–​77.
Ohala, J. (1978) “Production of tone” in Fromkin, V. A. (ed.) Tone: A linguistic survey.
New York: Academic Press, pp. 3–​39.
Ohala, J. (1981) “The listener as a source of sound change” in Masek, C. S. Hendrick,
R. A., and Miller, M. F. (eds.) Papers from the parasession on language and behavior.
Chicago: Chicago Linguistic Society, pp. 178–​203.
Pierrehumbert, J. B. (1980) The phonology and phonetics of English intonation. Ph.D.
Diss., Massachusetts Institute of Technology.
Pike, K. (1948) Tone languages. Ann Arbor: University of Michigan Press.
Potisuk, S., Gandour, J., and Harper, M. P. (1997) “Contextual variations in trisyllabic
sequences of Thai tones”, Phonetica, 54, pp. 22–​42.
Shang, X.-​Y. (2000) “Duiwai hanyu cihui dengji dagang jia yi ji ci shengdiao guilü
de diaocha [A survey of the tonal patterns and stress patterns of entries of levels
A and B in the outline of Chinese proficiency (Graded Vocabulary)]”, Chinese
Teaching in the World, 52, pp. 43–​47.
Shen, X. (1990) “Tonal coarticulation in Mandarin”, Journal of Phonetics, 18,
pp. 281–​295.
Sun, S. (1998) The development of a lexical tone phonology in American adult learners
of Standard Mandarin Chinese. Honolulu: University of Hawai’i Press.
Venditti, J. (2005) “The J_​ToBI model of Japanese intonation” in Jun, S.-​A. (ed.)
Prosodic typology: The phonology of intonation and phrasing. New York: Oxford
University Press, pp. 172–​200.
Wang, Y.-​J. (1995) “Ye tan meiguo ren xuexi hanyu shengdiao [On American learners’
tone acquisition]”, Language Teaching and Research, 3, pp. 126–​140.
Wang, Y.-​J. (1997) “Yangping de xietong fayin yu waiguoren xuexi yangping [The coar-
ticulation of Tone 2 and the acquisition of T2 by foreigners]”, Language Teaching
and Research, 4, pp. 94–​104.
Xu, Y. (1994) “Asymmetry in contextual tonal variation in Mandarin” in Chang, J.-​W.,
Huang, J.-​T., Hue, C.-​W., and Tzeng, O. J. L. (eds.) Advances in the study of Chinese
language processing, vol. 1. Taipei: Department of Psychology, National Taiwan
University, pp. 383–​396.
Xu, Y. (1997) “Contextual tonal variations in Mandarin”, Journal of Phonetics, 25,
pp. 61–​83.

368 Hang Zhang

Xu, Y. (2001) “Sources of tonal variations in connected speech”, Journal of Chinese
Linguistics (Monograph series), 17, pp. 1–​31.
Yang, B. (2015) Perception and production of Mandarin tones by native speakers and L2
learners. Berlin, Heidelberg: Springer.
Yip, M. (1980) The tonal phonology of Chinese. Ph.D. Diss., Massachusetts Institute
of Technology (Published 1990, New York: Garland Publishing).
Yip, M. (2002) Tone. Cambridge: Cambridge University Press.
Zhang, H. (2010) “Phonological universals and tone acquisition”, Journal of Chinese
Language Teachers Association, 45(1), pp. 39–​65.
Zhang, H. (2013) The second language acquisition of Mandarin Chinese tones by
English, Japanese and Korean speakers. Ph.D. Diss., University of North Carolina
at Chapel Hill.
Zhang, H. (2014) “The third tone: Allophones, sandhi rules and pedagogy”, Journal of
Chinese Language Teachers Association, 49(1), pp. 117–​145.
Zhang, H. (2015) “Positional effects in second language Chinese tones”, Journal of
Chinese Language Teaching, 12(2), pp. 1–​30.
Zhang, H. (2016) “Dissimilation in the second language acquisition of Chinese tones”,
Second Language Research, 32(3), pp. 427–​451.
Zhang, J. (2002) The effects of duration and sonority on contour tone distribution: A
typological survey and formal analysis. New York: Routledge.
Zhang, J. (2004) “The role of contrast-​specific and language-​specific phonetics in con-
tour tone distribution” in Hayes, B., Kirchner, R., and Steriade, D. (eds.) Phonetically
based phonology. Cambridge: Cambridge University Press, pp. 157–​190.
Zhu, H., and Dodd, B. (2000) “The phonological acquisition of Putonghua (Modern
Standard Chinese)”, Journal of Child Language, 27, pp. 3–​42.

Language index

Arabic 20, 248n8 Dakota 27

Danish 27, 146
Bambara 142 Dutch 111–115, 120, 130–131, 133,
Bantu 11, 76–77, 146, 227, 229, 231, 135–137, 321
234, 241–247; Kerewe 152, 154;
Kinande 234–235, 247, 248n; English 2, 4, 9, 14, 22, 25–26, 32–33,
Logoori 245; Luganda 230, 245; 35–36, 38, 41–48, 61–64, 66–73, 75,
Shambala 245; Tiriki 142, 245; 78, 86, 111, 142–144, 146, 148–149,
Xitsonga 18, 22, 31, 48, 230, 241, 152–154, 171, 199, 213, 218, 229–230,
244–245; Zulu 319 239, 253, 317–321, 323, 331–333,
Basque 48, 241, 253, 256–258, 265, 269, 339–343, 346–348, 350–352, 354–362,
270–271 365n2; American 63–64, 66, 111,
331–332, 346; British 111; Hong Kong
Catalan 86 146, 149, 152, 334; Singapore 148–149,
Chimwiini 230 152
Chinese 15, 71–72, 80, 85, 87–88, 142, Etsako 246
144, 146, 149–150, 152–154, 155n4,
155n10, 160–166, 199, 209, 217, 227, Filipino 319
231, 241–244, 246–247, 275, 306, French 10, 152, 229–230, 248n7, 318,
309, 310n9, 318, 320–324, 345–353, 320–321
356, 361–365; Boshan dialect 163; Frisian 133; West Frisian 112–114
Cantonese 4, 141, 146, 150, 152,
161, 181, 320, 331–335, 337–343; German 16, 111–112, 241, 318–319
Chongming dialect 85, 159–160, Greek 86, 229–230, 239, 272n2
165–166, 172, 189; Fuzhou dialect 2,
80, 82, 86–95, 98–99, 101–105; Hakka Hawaiian 152–153, 155n12
dialect 148, 152, 162; Hangzhou Hindi 171, 317
dialect 149; Mandarin 87–90, 92–93, Hungarian 26, 43, 46–47
143, 161–162, 213, 218, 318, 320–324,
326–328, 352; Min dialect 80, 278; Igala 144
Pingyao dialect 3, 85, 275, 293, 295, Irish 111
297, 299–301, 304–309; Shanghai Italian 15–17, 21, 26–27, 34–35, 39,
dialect 3, 142, 149, 155n5, 198, 200, 41, 43–44, 48, 54n10, 55n22–23,
202, 205, 207, 209, 213–220; Tianjin 55n29, 230
dialect 142, 161, 244; Wu dialect 159,
163, 200, 213, 246; Wuxi dialect 246; Japanese 3–4, 16, 145–146, 148, 150,
Xiamen dialect 3, 275, 278–286, 289, 152–153, 155n6, 252–254, 256–257,
307, 309, 310n7 259, 262, 264–265, 269–271, 346–347,
Copperbelt Bemba 242–244, 247 350–352, 354–361, 365n2

370 Language index

Kera 142, 146–147 Pasiego 231, 234
Korean 4, 230, 346–347, 351–352, Plains Cree 149, 153, 156n13
354–360, 365n2 Portuguese 55n28
Kru 233–234; Vata 233, 235, 247 Punjabi 154
Kukuya 142, 148
Kwa 233–234; Akan 232–235, 241, Salish languages 11
243, 247; Ga 142; Gwa Nmle Scots 111, 133n1
233, 235, 247; Nawuri 234, 247; Seneca 154
Twi 142 Somali 142, 148, 236, 247
Spanish 229–231
Lango 246 Swedish 146, 213, 218
Latin 85
Low Saxon 111–113, 133 Thai 146, 155n9, 160, 171, 320, 348
Tibetan 29, 232
Margi 142 Turkish 40, 86, 231–232, 234, 238, 248n6
Mayo 154
Mende 142, 145, 152 Vietnamese 11, 15, 161, 348, 362
Miao 163
Mixtec 245, 247; Peñoles Mixtec 245 Wolof 236

Ngizim 142 Yaqui 154

Norwegian 146, 152 Yoruba 152, 161

Subject index

accent 3–4, 112–116, 131, 141, 145–146, c-command 282–284, 286, 288, 292,
155n2, 155n6, 163, 199, 241, 253–254, 304–305, 307
257, 264, 271, 272n2, 346–347, 351 chain shift 279
accented syllable 111, 120, 131, 133n1, Chao’s letter 3, 162–164, 179, 188
150 child language 317, 323, 331–343
Accentual Phrase 347 citation tone 80, 201, 205, 216, 278–279,
acoustic cue 143–144, 199, 219, 281, 293–296, 307, 333
319, 328 clitic 2, 10, 15–16, 18, 21, 26, 28, 31–32,
acquisition 1, 4, 317–320, 329, 345, 347, 34–35, 38, 41–45, 47–52, 55n23,
350–351, 354, 363; acquisition of 80–89, 91–95, 97–99, 101–105, 228,
lexical tones 320 231–232, 234, 236–237, 239–240, 246,
Adjunction Approach 10, 12, 29, 32–33, 272n8
36, 51–52, 55 clitic group 2, 10, 14, 30, 53n4, 54n8, 80,
Advanced Tongue Root (ATR) 232 82–86, 94–95, 98–99, 101–105, 227,
affix 21, 25–26, 31, 33–36, 39–41, 44–47, 310n7
49–52, 54n17, 55n21, 82–83, 94, coarticulation 143, 205, 216, 236,
103, 105n12, 227, 234, 239–240, 247, 238, 247, 248n6, 345–346, 348–350,
248n6, 307 356–359, 361–365
Alignment Theory 12, 276, 279, 307 coda 20, 44, 54n13, 61–63, 65–66,
Align-Wrap Theory 278, 309 75, 77, 87, 116, 120, 153, 200, 229,
allotone 3, 159–160, 167, 169–170, 310n8, 352
172–175, 177–181, 189–190, 365 Composite Group 2, 10–11, 18–19, 28,
anticipatory coarticulation 143, 30, 34, 36, 39–43, 46, 49–53, 55n26,
345–346, 348–350, 357–359, 55n28
361–363, 365 Composite Prosody Model 10, 29, 34,
anticipatory effect 205, 207, 349, 358 36–37, 39, 43, 47–53, 55n27, 55n31
assimilation 26–27, 35–36, 41, 45–47, 86, compound 21–22, 26, 31, 34–35, 42–43,
227–230, 234–235, 246, 319, 348 45–46, 52, 54n18, 55n28, 155n4, 200–
Autosegmental Phonology 162 202, 205–207, 209, 213–220, 232–233,
base tone 200, 205–206, 216, 309n2 Compound Stress Rule 22, 26, 45–46
binarity constraints 262, 264–265, 269, consonant harmony 227–228, 231, 238,
271 240, 246–247, 248n8
binary 19–20, 74, 78, 162, 243, 253, constraint ranking 32, 55n27, 269–270,
255, 262–265, 278, 301–304, 308–309; 306
binary branching 20, 253, 264 contour tone 142, 152, 162, 322–323,
body 61, 63 328, 345–351, 357–359, 361–365
branching 20, 36, 253, 259, 262–264, contrastive focus 198, 201–202, 206, 215,
283, 298 217–220

372 Subject index

Direct Reference Approach 306 intertonal effects 4, 345–346, 350, 353,
Direct Reference Theory 309 356, 362, 364
dissimilation 4, 244, 348, 363–364 Intervocalic s- Voicing (ISV) 16
Distributional Typology approach 12, 28 interword phonology 228, 241
domain c-command 284 intonation 111, 128, 130, 133, 143,
domain juncture 13, 24 150, 198, 202, 228, 241–242, 347,
domain limit 13, 24 352, 356, 365n3; intonation contour
domain span 13, 24 111, 130
downstep 142, 163, 241 intonational phrase 11–12, 14, 18, 22, 28,
30–31, 50–51, 84
edge c-command principle 304–305 Isomorphism 252–253, 271
Edge-based theory 253, 282
empty category 281–282, 292 K-condition 284, 286
enclitic 2, 80, 82, 86–89, 91–95, 98–99,
101–105 language acquisition 1, 4, 317, 350
end-based approach 282 the Law of Finals 2, 63, 70, 74, 77
EqualSisters constraint 261 the Law of Initials 2, 63, 74, 77
left-branching structure 259, 262–263,
F0 (pitch) development 4, 331–332 298
F0 analysis 334 left-dominant sandhi (LDS) 200–202,
faithfulness 237, 276 205, 215–216
features 52, 129, 143, 152–153, 155n8, Lexical Category Condition 280
162, 198, 200–201, 231, 233, 237–238, lexical phonology 9, 230, 237, 240
240–241, 323, 328, 347, 350, 362 lexical word 31, 42, 45, 141, 240, 276
flapping 229 liaison 10, 229–230, 248n7
foot 2, 11, 14, 16–17, 19–20, 23, 25, 27, locality 288, 292
38–40, 49–50, 55n29, 67–78, 84–85,
142, 144, 146–147, 155n7, 228–229, mapping rules 31, 40, 48
231, 238–240, 244, 247, 306 markedness 276, 280, 346, 350–351, 353,
frequency (of collocation) 228, 232–236, 362–363
240, 246–247 Match Theory 3, 9–12, 28–33, 36, 39, 47,
function word 15–16, 21, 31–34, 39, 41, 51–52, 53n3, 54n6, 55n21, 252, 271,
43–45, 47, 49, 52, 55n29, 227–228, 276–278, 281, 307, 309
231–232, 234, 236–237, 239, 246–247, max onset 2, 61–72, 74–75, 77
352 Maximal Parsing principle 38
functional category 280, 286, 288 m-command 283–286, 289–292,
functional relation 282, 284, 286, 293, 307, 309
296–297, 304–305 meter 3
metrical phonology 67
Generalized Alignment Theory 276 minimal pair 35, 141, 148, 320, 347
geographical cline 2, 111–112, 128, 133 minimal word 71
glottalization 160–161, 165–167, 169, mora 16, 20, 53n4, 54n16, 54n20, 68,
172–180, 189–193 70–72, 84, 144, 146–147, 149, 153,
grammaticization 234 155n5, 163, 365n2
morpheme 33–35, 46, 87, 141, 149, 227–
Indirect Reference Approach 306 228, 237, 240, 244, 307, 323, 351–352
Indirect Reference Theory 309 morphosyntactic structure 3, 9, 19, 28,
infant speech perception 317, 320, 47, 53, 95, 200, 215–216, 220, 276
Initial Consonant Lenition 81 neutral tone 95, 144, 149, 155n5, 309n5,
interaction 18, 46, 83, 119–124, 126, 128, 352
171, 212, 296, 326–327 NoLapse constraint 259
Interrogative particles 83, 91, 105n14 nucleus 61, 153

Subject index 373

onset 2–4, 17, 20, 35, 38, 61–72, 74–75, Principle of the Morphological Core 39
77, 112, 116, 159–160, 164, 167, 169, Principle of the Morphological
172–179, 183, 187–193, 202–204, 207, Maximum 42
229, 246, 310n9, 322–323, 328, 331– Principle of Proper Headedness 37
332, 336, 341–343, 348–350, 357–359, Prosodic Essence Conjecture (PEC) 141,
361–364 151, 153–155
Optimality Theory 3, 69, 162, 275, 277 prosodic hierarchy 1–2, 4, 9–12, 15–20,
23, 27–29, 31–34, 36–39, 43, 47–53,
palatalization 86 53n2, 54n8, 55n25, 84–86, 104, 213,
parameter 2, 151, 153–154, 181–182, 282 217, 252–253, 277, 301, 304, 308
pause 72–73, 75, 83, 133n2, 236, 352, 362 prosodic patterns 1–2, 4, 151
peak alignment 112 Prosodic Phonology 1, 9–11, 27–29, 34,
Penultimate Vowel Lengthening 23, 31, 49, 51, 53n2, 54n5, 82–83, 102–103,
48 105n8, 275
perception 1, 143, 147, 160–161, 169, prosodic structure 2, 9, 13, 16, 19, 23–24,
171, 175, 179, 203, 309, 317–318, 29–31, 33–34, 37, 40, 45–47, 49,
320–323, 326, 328, 346 51–53, 84, 105n7, 149, 199–201, 204,
phoneme 67, 69, 234 209, 213, 215–220, 228, 253, 256, 262,
phonetic categories 171, 317, 319–320, 276, 278, 281–282, 304, 307–308, 346,
327–328 351
phonetic development 317, 327–328 prosodic unit 1, 85, 87, 94, 103, 213, 218,
phonetic precursor (of phonological 271, 275, 277, 304, 307–308, 345, 348,
process) 228, 230, 236, 243, 246 356
phonological domains 34, 53, 252, 277 prosodic word 2, 40, 55n28, 84, 86–87,
phonological phrase 2, 10–12, 14–15, 102–103, 105n10, 142, 144, 149, 201,
24–25, 28, 30, 34, 37, 43, 48–53, 84, 207, 209, 214, 216–219, 227–228, 233,
233, 244, 247, 252–255, 265–267, 276–277, 301–302, 304, 308, 356
276–282, 301–302, 304, 307–308, Prosodic Word Adjoiners 40
347, 356 prosody i, 1–4, 10, 29, 34, 36–37, 39,
phonological utterance 11, 54n6, 55n26 43, 47–53, 55n27, 55n31, 143, 146,
phonological word 2, 10–11, 14–18, 150–151, 153–155, 163, 198, 200, 203,
21–22, 25, 27–28, 30, 32, 34, 36, 252–253, 265–266, 268, 270–271, 309,
39–43, 49–52, 56, 84–85, 150, 155n10, 346–347
254, 301
phonologization 228, 237–238, 240, recursion 3, 10, 12–13, 17, 19–21, 23–25,
246–247, 248n6 29–32, 34, 37–39, 51–53, 54n12,
phonotactic constraint 35 54n15, 54n16, 252
phrasal process 233 restructuring 30, 32, 301, 304, 308
pitch accent 112–113, 116, 128, 131, 141, resyllabification 227, 229–230
146, 152–153, 155n2, 163, 199, 219, revised max onset 2, 61, 65–66, 72, 77
241, 257, 271, 272n2, 346–347, 351 rhyme 211–214, 216–217, 364
postlexical process 227–231, 233, rhythm 84, 230, 239–240
238, 243 right-branching structure 259
postlexical rule 227–228, 230–231, right-dominant sandhi (RDS) 200–202,
241, 244 205–206, 215–216, 220
post-verbal particles 92 rime 61–63, 65–66, 113, 116–119,
prefix 21, 26, 35, 41, 43–44, 48, 233, 128–131
241, 244
Principle of Constituent Sequencing 37 sandhi tone 80, 294–295, 309
Principle of Containment 37 Second Language Acquisition 1, 4
Principle of Maximal Parsing 38 Sonority Sequencing Principle 17, 19,
Principle of Minimal Distance 38, 25, 38
42, 55n27 Sound Pattern of English 9, 253

374 Subject index

s-structure 292 tone bearing unit 149, 241, 243, 320,
stress 2–3, 16–17, 20, 22, 25–26, 40–41, 347, 365n2
43–46, 54n4, 54n18, 55n22, 55n23, Tone and Break Indices (ToBI)
61–62, 65–78, 83, 85–86, 130, 141–155, Approach 12
198, 227–229, 231, 238–240, 244, tone contour 205, 215
246–247, 318, 343, 346–347 tone perception 160
stress assignment 20, 25–26, 40, 46, tone sandhi 3, 80, 104n2–4, 200,
54n4, 55n23, 67, 69, 73–75, 77, 85–86, 205–206, 215, 227, 242–244, 246–247,
145, 153, 227–229, 231, 239–240, 275, 279, 295, 309, 352
246–247 tone spread 18, 227, 231, 242–244
Strict Layer Hypothesis 2, 9–10, 12, 82, trochee 68, 71
277 typology 3, 12, 28, 152–155, 227, 236,
suffix 34–35, 40–46, 48, 55n24, 86, 142, 269, 309
144, 233, 239
syllabification 2, 54n13, 61, 63–64, 66, underlying pitch targets 3, 163–164,
69, 71–77, 227, 229–230 180–181
syllable 3, 11, 14–17, 19–21, 26–27, unification 152, 155
32, 35, 38–41, 44, 50, 54n15, 61–75, upstep 163
77–78, 80–81, 84–87, 93, 95, 98, 101, utterance 11, 54n6, 55n26, 83–84, 94,
103, 111–112, 116–117, 120–121, 103, 198, 242, 309, 362
130–131, 133n1, 141–145, 148–154,
155n7, 163, 165, 171–172, 200, 202, voicing 16, 21, 26–27, 34–36, 39, 43–46,
204–207, 209–219, 227, 229–231, 86, 143, 167, 173, 179, 200, 213, 227,
233–247, 248n6, 279, 294, 299, 301, 229–230, 238; voicing assimilation
304, 308, 309n3, 310n8, 319–320, rule 26
323, 334, 347, 349–350, 359, 363–364, vowel deletion 227, 230
365n2 vowel harmony 26, 43, 46–48, 86,
syllable weight 20, 61, 66, 71, 142 227–228, 231–234, 236–240, 242–243,
syntactic doubling 230 246–247, 248n6, 248n8
syntax-prosody mapping 252,
265–266, 271 Weight-Stress Principle 2, 72, 74, 77
well-formedness condition 71
tonal coarticulation 205, 216, 345, 362, word stress 41, 66–67, 69–70, 72, 78
365 Wrap Theory 12, 29, 307
Tonal Markedness Scale (TMS) 346, Wrap-Align Theory 281–282,
350–351, 353, 362–363 307, 309
tonal phonology 242, 362 Wrapping Theory 276, 279

You might also like