among other things, detects the turbulence of the fricative [s] and the fact that it is followed by a stop. It is
quite likely that initially it will be impossible to determine whether the stop is [p] or [b] because acoustically
it will be very unclear which it is. The next stage in sound perception is the PHONOLOGICAL STAGE. It
is at this stage that it will become clear that the sound in question is /p/, not /b/. If you are a speaker of
English, you know how sounds in your language function. You know the phonotactic constraints on the
positions where sounds can appear.
(i) You know that if a word-initial fricative is followed by a stop, that fricative must be /s/. Only /s/ is
allowed to occur at the beginning of a word if the second sound of the word is a consonant. No English
word can begin with /zk/, /fm/, / k/ etc.
(ii) You also know that if the word begins with /s/ and the sound following /s/ is a stop, that stop must be
voiceless. There are words like spin, spoon, stick, skin etc. where /s/ is followed by a voiceless stop.
But there are no words like *sbin, *sboon, *sdick, or *sgin where /s/ is followed by a voiced stop. This
is a phonotactic constraint on the combination of fricatives with stops in English phonology. It is not
something simply determined by their acoustic properties. If the /s/ of spin is electronically spliced, you
would probably hear the word left behind as [bin], not [pin]. Why is this? It is because the main cue for
distinguishing between [p] and [b] occurring initially in a stressed syllable is aspiration. If you detect
aspiration, you assume it is [ph]; if you do not you assume it is [b]. As the sound in spin was preceded
by /s/ before splicing off the [s], it was not initial and so it was unaspirated. So it is perceived as [b].
Obviously, linguistic knowledge of this kind, this COMPETENCE, lies hidden deep in the mind and
you are unlikely to be conscious of it without taking a course in linguistics.
We have established that there is no one-to-one match between acoustic phonetic cues and the phonological
interpretation we give them. What is perceived as the ‘same’ sound is not physically the same sound in all
contexts. Very much depends on the context in which the cues are perceived and on what we know to be
permissible in that context in the language.
One model that has been proposed to account for the way people perceive speech is ANALYSIS BY
SYNTHESIS (cf. Halle and Stevens (1962), Studdert-Kennedy (1974, 1976), Stevens and House (1972)).
Its proponents claim that hearers recognise speech sounds uttered by speakers by matching them with
speech sounds that are synthesised in their heads. Specifically, the synthesising is said to involve modelling
the articulatory gestures that the speaker makes to produce those sounds. In the light of what was said above
concerning the incredible speed of word recognition, it is implausible to expect hearers to perform the
analysis by synthesis routine. So, many reject this model.
Clearly, the perception of individual speech sounds is far from straightforward. But the perception of
running speech presents an even greater challenge. A major problem (which people are most acutely aware
of when listening to a language in which they have little competence) is that, in fluent speech, words come
out in a gushing stream. In purely physical terms, it is normally impossible to hear where one word ends and
the next one begins. Looking up each word in the mental lexicon as it is heard is not a credible strategy.
But even if it were possible to separate out words, which clearly it is not, there would be the additional
problem of NOISE, in a very broad sense. In many real life situations, there is not a perfect hush around us
as we speak. There is noise. Lots of noise. In a pub, at a party, at work, in a railway station, in the home, there
are often other people talking, banging, operating noisy machines, playing loud music etc. So we hear some
of the words only partially —if at all. Yet we manage to work out what the other person is saying. How do
we do it? Knowing what is relevant in the context helps. We can make intelligent guesses.
In cases where we communicate using almost fixed formulas, GUESSING is relatively easy. Imagine you
drop in at a friend’s house at 11 a.m. and shortly after welcoming you, your host says:
Would you like *** or ***?
You do not hear the bits marked by asterisks because a loud heavy goods lorry goes past the open window
as she says ***. I expect you would have no problem guessing that the words you did not hear were tea and
coffee. Experience tells you that in this situation they are the most likely words to be used.
Sometimes you are luckier. You manage to hear part of a word properly. In this situation again it is
usually possible to make out the entire word. Suppose the hearer identifies a bit of speech three syllables
long, beginning with [r] and ending in [te] as in [11.10a]:
a. If you’re cold, get closer to the r***[te].
b. If you’re cold, get closer to the radiator.
If you’re cold, get closer to the red heater.
The hearer can then guess which word or group of words with the phonological outline that has been
perceived seems to make sense in the context. Either a radiator or a red heater could provide warmth. A
quick inspection to see whether there was a radiator or a red heater in the room would help to settle it.
If we are in a particular situation, and know what is RELEVANT in that situation, we can discard some
of the possible words we might think we hear; we can also reject homophonous words which are
inappropriate in the circumstances, and select the more plausible, relevant word:
[vIkez er en straIk]
If you live in the English shipbuilding town of Barrow where most of the working population are
employed in Vicker’s shipbuilding yard and you have seen hundreds of women and men staging a protest
outside the gate of the shipyard, you would probably recognise the word as referring to workers at Vicker’s
and not the vicars with dog collars from all the town’s churches. Knowledge that priests do not strike is also
helpful, of course. So you would use your world knowledge to eliminate vicars. Context and relevance are
vitally important in speech recognition when for some reason the meaning of the words perceived is to some
extent unclear.
Selective listening
Listening is an active, not a passive activity. The perception of continuous speech is not a simple matter of
deciphering raw noises that impinge on the eardrums. The classic example of SELECTIVE LISTENING is
the so-called ‘cocktail party phenomenon’ where the listener’s role in creating meaning is vital. You can
choose to home in on a specific conversation when a dozen loud conversations are going on in the room. This
shows the importance of selection. The listener constructs meaning from a specific set of auditory signals —
and ignores the rest.
Psychologists have performed experiments to explore the nature of selective listening. They have done
this by designing SHADOWING experiments where subjects repeat instantaneously a taped utterance, word
for word, as they listen to it. Meanwhile, the subjects listen simultaneously, in one ear, to another message
which they are told to ignore. The experiment is set up in such a way that the two messages that are listened
to are equally clear and have the same tempo and volume.
In these experiments people easily manage to focus on the utterance that they are shadowing and to
ignore the utterance that they are not shadowing. They might not even notice that the utterance which they are
not shadowing is in a foreign language, or whether the speaker was a man or a woman (Cherry 1953).
Exploiting syntactic and semantic clues
Do you have a friend or acquaintance who has that most infuriating habit of finishing your sentences for
you? Both the syntactic and semantic aspects of the utterance may be so predictable that the listener knows
what you are going to say even before the words come out:
A. Do you know what? She has threatened to…
B. …sue the police.
The syntactic semantic context plays a role in speech comprehension. Listeners use frames provided by
the syntax to retrieve a word with the appropriate meaning from the mental lexicon. This is easily seen in
the handling of ANAPHORIC EXPRESSIONS (i.e. grammatical elements that refer back to something
already mentioned) which are only interpretable in a specific syntactic context. To interpret anaphoric
expressions the hearer must be able to identify the element that has already been mentioned which is being
referred back to. Imagine reading the following sentence in a recipe book:
Add cream to the meat casserole and leave it in the oven for 10 minutes.
How do you work out what it refers to? In this sentence it could conceivably refer to the meat or the
cream. But you do not even consider the cream. This sentence is not ambiguous. Our knowledge of the
world rules out the cream. It would be crazy to leave the cream cooking in the oven. But it is reasonable to
leave meat cooking in a casserole in the oven.
A similar point can be made with regard to STRUCTURAL AMBIGUITY (i.e. situations where a string
of words has different interpretations depending on how we group together the words). Take this example:
I bought some new shirts and jumpers.