Papers by Steven Greenberg
Hearing Research, 1994
The quasiperiodicity in the acoustic waveform in speech and music is a pervasive feature in our a... more The quasiperiodicity in the acoustic waveform in speech and music is a pervasive feature in our acoustic environment. The use of 200% amplitude modulated (AM) signals allows the study of rate and temporal envelope coding using three equal amplitude components, a situation that is frequently approximated in natural vocalizations. The recordings reported here were made in the ventral cochlear nucleus of the cat, a site of auditory signal feature enhancement and the origin of several ascending auditory pathways. The discharge rate vs modulation frequency relation was nearly always all-pass in shape for all unit types indicating that discharge rate is not a code for modulation frequency. Onset cells, especially onset-choppers and onset-I units, exhibited remarkable phase locking to the signal envelope, nearly to the exclusion of phase locking to the AM components. They exhibited lowpass temporal modulation transfer functions (tMTF) that occasionally had corner frequencies greater than 1 kHz. Primary-like, primary-like with notch, and onset-L units all exhibited considerable variability in their coding properties with tMTFs that varied from lowpass to bandpass in shape. The bandpass shape became more frequent with increasing stimulus levels. A common feature in cochlear nucleus units was less sensitivity to the level of the AM stimulus than is present in the auditory nerve. Phase locking to the envelope persisted over a wider range of stimulus levels than rate changes in a subset of the units studied. The tMTFs for a 100% sinusoidally modulated, spectrally-flat noise was similar in amplitude and bandwidth to those obtained for AM stimuli. The tMTF was relatively insensitive to carrier frequencies different than the unit characteristic frequency. AM synchrony vs level curves exhibited systematic shifts that equaled or exceeded dynamic rate shifts that occur with increasing levels of a noise masker. Phase locking to the envelope was robust under a wide variety of signal conditions in all unit types. The ordering of response types based on the maximum of the tMTF is onset-I = onset-chop > choppers = primarylike-with-notch = onset-L > primarylike.
Auditory theory has traditionally pitted "place" (the tonotopically organized spatial pattern of ... more Auditory theory has traditionally pitted "place" (the tonotopically organized spatial pattern of excitation) versus "time" (the temporal pattern of discharge) with respect to the neural representation underlying specific attributes of acoustic sensation. This long-standing controversy has been of particular significance for models of pitch and frequency analysis, but casts its theoretical shadow over the discipline as a whole. A potential resolution of this historical opposition is proposed, in which place and time are viewed as flip sides of a complex representational matrix of neural activity, bound together through the mechanics of the cochlear traveling wave and its interaction with central loci of coincidence detection and inhibition. Frequency analysis is viewed as possessing two operational components. One is excitatory, based on spatially circumscribed patterns of temporally coherent peripheral activity and processed by central coincidence-sensitive neural elements. The other involves central inhibitory elements driven by non-synchronous activity distributed over a broad tonotopic domain. Together, these two mechanisms can account for the preservation of frequency selectivity across a wide range of frequencies and sound pressure levels, despite dramatic changes in the average-rate-based profile of neural activity. The traveling wave is also of importance in formatting the peripheral spatio-temporal response pattern germane to periodicity analysis and the perception of pitch. The present framework resolves the long-standing schism between spectral and temporal theories by virtue of a formulation in which pitch is viewed as resulting from the interplay of place and temporal information bound together into a coherent representation through the operation of central coincidencesensitive neural populations. Within this perspective, both frequency resolution and neural synchrony are required for a robust sensation of pitch to occur.
Due to the incompletely understood nature of prosodic stress, the implementation of an automatic ... more Due to the incompletely understood nature of prosodic stress, the implementation of an automatic transcriber is very difficult on the basis of the currently available knowledge. For this reason, a number of data driven approaches are applied to a manually annotated set of files from the OGI English Stories Corpus. The goal of this analysis is twofold. First, it aims to implement an automatic detector of prosodic stress with sufficiently reliable performance. Second, the effectiveness of the acoustic features most commonly proposed in the literature is assessed. That is, the role played by duration, amplitude and fundamental frequency of syllabic nuclei is investigated. Several data-driven algorithms, such as Artificial Neural Networks (ANN), statistical decision trees and fuzzy classification techniques, and a knowledge-based heuristic algorithm are implemented for the automatic transcription of prosodic stress. As reference, two different subsets from the OGI English stories databa...
In a previous study (14) we had concluded that amplitude and duration are the most important acou... more In a previous study (14) we had concluded that amplitude and duration are the most important acoustic parameters underlying the patterning of prosodic stress in casually spoken American English, and that fundamental frequency (f o ) plays a only minor role in the assignment of stress. The current study re-examines this conclusion (using both the range and average level of
Phonetica, 2009
This study was motivated by the prospective role played by brain rhythms in speech perception. Th... more This study was motivated by the prospective role played by brain rhythms in speech perception. The intelligibility – in terms of word error rate – of natural-sounding, synthetically generated sentences was measured using a paradigm that alters speech-energy rhythm over a range of frequencies. The material com-prised 96 semantically unpredictable sentences, each approximately 2 s long (6–8 words per sentence), generated by a high-quality text-to-speech (TTS) synthesis engine. The TTS waveform was time-compressed by a factor of 3, creating a signal with a syllable rhythm three times faster than the original, and whose intel-ligibility is poor (<50% words correct). A waveform with an artificial rhythm was produced by automatically segmenting the time-compressed waveform into consecutive 40-ms fragments, each followed by a silent interval. The parameters varied were the length of the silent interval (0–160 ms) and whether the lengths of silence were equal (‘periodic’) or not (‘aperio...
8th European Conference on Speech Communication and Technology (Eurospeech 2003)
Temporal dynamics provide a fruitful framework with which to examine the relation between informa... more Temporal dynamics provide a fruitful framework with which to examine the relation between information and spoken language. This paper serves as an introduction to the special Euro-speech session on " Time is of the Essence – Dynamic Approaches to Spoken Language, " providing historical and conceptual background germane to timing, as well as a discussion of its scientific and technological prospects. Dynamics is examined from the perspectives of perception, production, neurology, synthesis, recognition and coding, in an effort to define a prospective course for speech technology and research. Speech is inherently dynamic, reflecting the motion of the tongue and other articulators during the course of vocal production. Such articulatory dynamics are reflected in rapid spectral changes, known as formant transitions, characteristic of the acoustic signal. Although such dynamic properties have long been of interest to speech scientists, their fundamental importance for spoken language has only recently received broad recognition. The special Eurospeech session on " Time is of the Essence – Dynamic Approaches to Spoken Language " is designed to acquaint the speech community with current research representative of this new emphasis on dynamics from a broad range of scientific and technical perspectives. The current paper serves as a brief introduction, providing historical and conceptual background for the session as a whole. Traditionally, articulatory mechanisms have been examined principally from a biomechanical perspective. Given the structural constraints imposed through phylogenetic descent, speech production has generally been viewed as nature's way of solving an exceedingly complicated problem with limited biome-chanical means. The jaw, tongue, lips and other articulators can move only so fast, their rates of motion limited by their anatomical and physiological characteristics. Such properties reflect an evolutionary process long antedating the origins of human vocal communication. From this purely articulatory perspective, speech's spectro-temporal properties are primarily the consequence of biomechanical constraints imposed through the course of human (and mammalian) evolution. If the fine details of spoken language are governed by vocal production, how does the brain decode speech given the acoustic nature of the input to the auditory system? One prominent model, known as " Motor Theory, " posits that the brain back-computes the articulatory gestures directly from the acoustic signal [17]. In essence, this framework likens the auditory system to a delivery service that transmits packages containing articulatory gestures decoded at some higher level of the brain. The process of perceiving (and ultimately understanding) speech thus reduces to …
Journal of Phonetics, 1988
Preface "Representation of Speech in the Auditory Periphery" A theme issue devoted to the auditor... more Preface "Representation of Speech in the Auditory Periphery" A theme issue devoted to the auditory. representation of speech may appear, at first glance, to be misplaced in the Journal of Phonetics. Auditory physiology and psychoacoustics have traditionally lain outside the domain of acoustic and articulatory phonetics, disciplines whose empirical and theoretical foundations are firmly grounded in physical acoustics and the physiology of the vocal apparatus. What might phoneticians gain from a deeper understanding of the auditory system? At the beginning of his career, Georg von Bekesy was a communications engineer working on ways to improve telephone transmission in his native Hungary. His distinguished research on cochlear mechanics began as a consequence of his conviction that an understanding of auditory function would be useful for designing better telecommunications equipment. Although telephone design is no longer a driving force in auditory research, von Bekesky's approach is probably as valid today as it was in the years just following the First World War. Nowadays, hearing mechanisms are studied by a diverse group of scientists, including otologists, physiologists, biologists, psychologists, computer scientists, physicists and electrical engineers, whose interests range from machine recognition of speech to cochlear prostheses. Thus, there is reason to believe that phoneticians may also benefit from a broader knowledge of auditory mechanisms. In particular, it would seem that auditory physiology may provide some insight into those factors constraining the temporal and spectral structure of speech sounds that can not be adequately explained on the basis of articulatory physiology. In editing this issue of the Journal of Phonetics I have been aided by a number of individuals who have given generously of their time and knowledge. John Ohala originally broached the idea of an issue devoted to auditory processing of speech and provided many useful suggestions for its design and implementation. Bjorn Lindblom also helped with the original organization and provided encouragement throughout the project. Marcel van den Broecke has helped steer this issue's development from its inception through to publication, providing assistance at many crucial points. I am also grateful to Julie Gorman of Academic Press who answered a number of key questions and who has overseen the production of the finished volume. Thanks are also owed to several of my colleagues in the Department of Neurophysiology, University of Wisconsin-Bill Chiu, Bill Rhode and Bob Wickesberg-who helped out in one manner or another at a crucial time. Finally, I would like to express my deepest appreciation and gratitude to the authors of the articles which appear in this issue and to the referees who reviewed their contributions for the Journal of Phonetics.
Understanding the human ability to reliably process and decode speech across a wide range of acou... more Understanding the human ability to reliably process and decode speech across a wide range of acoustic conditions and speaker characteristics is a fundamental challenge for current theories of speech perception. Conventional speech representations such as the sound spectrogram emphasize many spectro-temporal details that are not directly germane to the linguistic information encoded in the speech signal and which consequently do not display the perceptual stability characteristic of human listeners. We propose a new representat?ohM format, the modulation spectrogram, that discards much of the spectro-temporal detail in the speech signal and instead focuses on the underlying, stable structure incorporated in the low-frequency portion of the modulation spectrum distributed across critical-band-like channels. We describe the representation and illustrate its stability with color-mapped displays and with results from automatic speech recognition experiments.
Seminars in Hearing, 2004
Throughout his career, Ira Hirsh studied and published articles and books pertaining to many aspe... more Throughout his career, Ira Hirsh studied and published articles and books pertaining to many aspects of the auditory system. These included sound conduction in the ear, cochlear mechanics, masking, auditory localization, psychoacoustic behavior in animals, speech perception, medical and audiological applications, coupling between psychophysics and physiology, and ecological acoustics. However, it is Hirsh&amp;amp;amp;amp;#x27;s work on auditory timing of simple and complex rhythmic patterns, the backbone of speech and music, that are at the heart of ...
Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181), 1998
The Journal of the Acoustical Society of America, 1999
Classical models of speech recognition (both human and machine) assume that a detailed, short-ter... more Classical models of speech recognition (both human and machine) assume that a detailed, short-term analysis of the signal is essential for accurate decoding of spoken language via a linear sequence of phonetic segments. This classical framework is incommensurate with quantitative acoustic/phonetic analyses of spontaneous discourse (e.g., the Switchboard corpus for American English). Such analyses indicate that the syllable, rather than the phone, is likely to serve as the representational interface between sound and meaning, providing a relatively stable representation of lexically relevant information across a wide range of speaking and acoustic conditions. The auditory basis of this syllabic representaion appears to be derived from the low-frequency (2–16 Hz) modulation spectrum, whose temporal properties correspond closely to the distribution of syllabic durations observed in spontaneous speech. Perceptual experiments confirm the importance of the modulation spectrum for understa...
Page 191. Dynamics of Speech Production and Perception 171 P. Divenyietal.(Eds.) IOS Press, 2006 ... more Page 191. Dynamics of Speech Production and Perception 171 P. Divenyietal.(Eds.) IOS Press, 2006 © 2006 IOS Press. All rights reserved. The Role of Temporal Dynamics in Understanding Spoken Language Steven GREENBERG1 ...
Interspeech 2005, 2005
We present a new feature representation for speech recognition based on both amplitude modulation... more We present a new feature representation for speech recognition based on both amplitude modulation spectra (AMS) and frequency modulation spectra (FMS). A comprehensive modulation spectral (CMS) approach is defined and analyzed based on a modulation model of the band-pass signal. The speech signal is processed first by a bank of specially designed auditory band-pass filters. CMS are extracted from the output of the filters as the features for automatic speech recognition (ASR). A significant improvement is demonstrated in performance on noisy speech. On the Aurora 2 task the new features result in an improvement of 23.43% relative to traditional mel-cepstrum front-end features using a 3 GMM HMM back-end. Although the improvements are relatively modest, the novelty of the method and its potential for performance enhancement warrants serious attention for future-generation ASR applications.
The role of duration, amplitude and fundamental frequency of syllabic vocalic nuclei is investiga... more The role of duration, amplitude and fundamental frequency of syllabic vocalic nuclei is investigated for marking prosodic stress in spontaneous American English discourse. Local maxima of di#erent evidence variables, implemented as combinations of the three basic parameters # duration, amplitude and pitch #, are supposed to be related with prosodic stress. As reference, two di#erent subsets from the OGI English stories database were manually marked in terms of prosodic stress bytwo di#erent trained linguists. The ROC curves, built on the training examples, show that both transcribers grant a major role to the amplitude and duration rather than to the pitch of the vocalic nuclei. More complex evidence variables, involving a product of the three basic parameters, allow around 80# primary stressed and 77# unstressed syllables to be correctly recognized in the test #les of both transcribers' datasets. The agreementbetween the two transcribers on a set of common #les supplies only sl...
The displacement pattern of basilar membrane motion is tonotopically organized, with high frequen... more The displacement pattern of basilar membrane motion is tonotopically organized, with high frequencies reaching their apogee towards the base of the cochlea and low frequencies achieving their maximum near the apex (von Békésy, 1960). This systematic relationship between peak displacement and cochlear location serves as the linchpin of the "place" model of spectral representation and of auditory theory in general. In recent years this "classic" place model has come under increasing scrutiny in light of experimental observations demonstrating that this spatial organization of excitatory activity is generally discernible only under a restricted set of conditions in the auditory periphery, thus calling into question its ability to subserve frequency coding at sound pressure levels typical of speech communication and musical performance. In place of classic tonotopy, many recent models of pitch and frequency analysis focus on the temporal properties of peripheral acti...
The Journal of the Acoustical Society of America, 1998
A detailed auditory analysis of the short-term acoustic spectrum is generally considered essentia... more A detailed auditory analysis of the short-term acoustic spectrum is generally considered essential for understanding spoken language. This assumption is called into question by the results of an experiment in which the spectrum of spoken sentences (from the TIMIT corpus) was partitioned into quarter-octave channels and the onset of each channel shifted in time relative to the others so as to desynchronize spectral information across the frequency plane. Intelligibility of sentential material (as measured in terms of word accuracy) is unaffected by a (maximum) onset jitter of 80 ms or less and remains high (>75%) even for jitter intervals of 140 ms. Only when the jitter imposed across channels exceeds 220 ms does intelligibility fall below 50%. These results imply that the cues required to understand spoken language are not optimally specified in the short-term spectral domain, but may rather be based on some other set of representational cues such as the modulation spectrogram [S...
Uploads
Papers by Steven Greenberg