Illuminating variation
Individual differences in entrenchment
of multi-word units
Published by
LOT
Kloveniersburgwal 48
1012 CX Amsterdam
The Netherlands
phone: +31 20 525 2461
e-mail: lot@uva.nl
http://www.lotschool.nl
Cover illustration: picture of an artwork by Piet Stockmans, photographed
by Daphne Snijders. To me, it visualizes the dynamic character of mental
representations of language, which may best be viewed as moving
targets.
ISBN: 978-94-6093-333-2
NUR: 616
Copyright © 2019: Véronique Verhagen. All rights reserved.
Illuminating variation
Individual differences in entrenchment of
multi-word units
Proefschrift
ter verkrijging van de graad van doctor
aan Tilburg University
op gezag van de rector magnificus, prof. dr. K. Sijtsma,
in het openbaar te verdedigen ten overstaan van een
door het college voor promoties aangewezen commissie
in de aula van de Universiteit
op vrijdag 10 januari 2020
om 13.30 uur
door
Véronique Anne Yvonne Verhagen
geboren op 12 december 1985 te Eindhoven
Promotor
prof. dr. A.M. Backus
Copromotores
dr. M.B.J. Mos
dr. J. Schilperoord
Promotiecommissie
prof. dr. W.B.T. Blom
prof. dr. E. Dąbrowska
prof. dr. H.-J. Schmid
dr. E. Zenner
The research reported in this dissertation was supported by an NWO
grant (de Nederlandse organisatie voor Wetenschappelijk Onderzoek –
the Dutch Research Council, project number 322-89-004)
Voorwoord
Het allereerste college dat ik volgde als student, was dat van het vak
Taalwetenschap. Het allereerste college dat ik als docent gaf, was het college
Taalwetenschap, en het vond plaats in de zaal waar ik destijds mijn eerste college
had gevolgd (het handboek en de opdrachten waren overigens niet meer hetzelfde
– ik wil niet de indruk wekken dat er geen ontwikkeling plaatsvindt in deze faculteit,
in tegendeel!). In de jaren die daarop volgden, heb ik ook aan de Universiteit Leiden
en bij de lerarenopleiding Nederlands aan Fontys taalkundevakken gedoceerd. Die
activiteiten hebben het afronden van mijn promotieonderzoek ‘ietwat’ vertraagd,
maar ze hebben ook me veel waardevolle kennis, ervaringen, en contacten
opgeleverd. Ik ben dankbaar voor de mogelijkheden die mij in dat opzicht zijn
geboden. Minstens zo dankbaar ben ik voor de ondersteuning van mijn
begeleiders bij het voltooien van mijn proefschrift.
Om te beginnen Maria; zonder haar voortvarendheid en betrouwbaarheid was
deze dissertatie er wellicht wel gekomen, maar dan had het gegarandeerd langer
geduurd. Dankjewel voor je betrokkenheid en goede adviezen, en je fijne
gezelschap tijdens conferenties. Na een workshop in Potsdam vroeg Jon Sprouse
of wij misschien zussen waren. Jij antwoordde toen verbaasd Nee, en voegde er
aan toe: hoogstens ‘academic sisters’. Je bent de beste grote academische zus
die ik me kan wensen.
Ad ben ik zeer dankbaar voor zijn nimmer aflatende vertrouwen. Er zijn niet
veel hoogleraren die zo wijs, ruimhartig, en in touch met hun feminine side zijn als
jij. De hoeveelheid mensen die een beroep op je doen is onvoorstelbaar groot en
toch neem je altijd de tijd voor alle vragen die iemand heeft. Als ik promovendi
ontmoette die Ad kenden, waren ze steevast jaloers op het feit dat hij mijn
promotor was.
Joost bewonder ik om zijn mooie invallen en formuleringen, en dank ik voor
zijn aanmoedigingen om te “ronken en blazen” en zijn vermogen om zaken vanuit
een andere hoek te bezien. Tijdens de verdediging van mijn masterscriptie vroeg
je mij: En als je het omgedraaid had? Als je mensen had gevraagd te beoordelen
hoe weínig de woorden bij elkaar horen? – een mogelijkheid die nooit in mij was
opgekomen. Ook tijdens mijn promotieonderzoek kwam je telkens met
waardevolle voorstellen om zaken eens om te draaien en wees je mij op het moois
in mijn data als ik vooral gefocust was op wat we er níet mee konden aantonen.
Antal van den Bosch ben ik zeer erkentelijk voor zijn waardevolle adviezen en
het feit dat hij mij in contact heeft gebracht met Jakub Zavrel en Louis Onrust.
Jakub is de oprichter van Textkernel – een bedrijf dat gespecialiseerd is
kunstmatige intelligentie op het gebied van HR en recruitment. Eén van hun
instrumenten, Jobfeed, zoekt het internet af naar vacatures. Dankzij deze
technologie en de behulpzaamheid van Jakub en zijn collega’s, heb ik een corpus
met vacatureteksten tot mijn beschikking gekregen. Mijn dank is groot. Louis’ hulp
bij het analyseren van de dataset bestaande uit ruim 1,36 miljoen vacatureteksten
was van onschatbare waarde. Ik ben hem heel dankbaar voor zijn geduld en
generositeit.
Prof. dr. Blom, prof. dr. Dąbrowska, prof. dr. Schmid, and dr. Zenner, thank you
very much for accepting the invitation to be part of the committee. I am greatly
honored that you have read my work and that you are willing to discuss it with
me.
Als promovenda en beginnend docent heb ik deel mogen uitmaken van een
departement dat gekenmerkt wordt door een buitengewone mate van kwaliteit en
collegialiteit. Adriana, Alex, Alwin, Anne, Annemarie, Carel, Charlotte, Chris,
Christine, Constantijn, David, Diana, Debby, Emmelyn, Emiel, Emiel, Eriko, Fons,
Hans, Jacqueline, Jan, Jan, Janneke, Jorrig, Jos, Joost, Julie, Juliette, Karin, Kiek,
Lauraine, Leonoor, Lieke, Loes, Mandy, Marc, Maria, Marie, Mariek, Marieke, Marije,
Marjolein, Marlies, Martijn, Martin, Menno, Monique, Nadine, Nadine, Naomi, Neil,
Nynke, Paul, Per, Peter, Rein, Renske, Ruben, Ruud, Saar, Sander, Tess, Yan, en
Yevgen, dank jullie wel voor alle interessante gesprekken, de fijne samenwerking
in onderwijsactiviteiten, het medeleven toen redacteur R. mij tot wanhoop dreef,
de verkwikkende wandelingen in de Oude Warande, de geweldige optredens van
de Malle-band, de fantastische departementsuitjes, Sinterklaasgedichtjes, en
kerstdiners.
Voordat ik als promovenda aan de slag ging, ben ik als student gevormd door
het werk van Ad, Carine, Erna, Karen, Guus, Helma, Jan, Jan Jaap, Jeanne, Jos,
Kutlay, Leon, Max, Mia, Odile, Piia, Rian, Sander, Sjaak, Ton, en Tineke. Dank voor
de boeiende colleges die ik met veel interesse bij jullie heb gevolgd en voor het
feit dat ik ‘op kamers’ mocht op de 4e verdieping.
Naast mijn aanstelling als onderzoeker in Tilburg, heb ik gedurende anderhalf
jaar taalkundevakken mogen verzorgen in Leiden bij de opleidingen Nederlandse
taal en cultuur en Taalwetenschap. Alex, Arie, Esther, Gijsbert, Maaike Beliën en
Maaike van Naerssen, Maarten, Olga, Ronny, Roosmaryn, Saskia, Tanja, Ton, en
Vivien, dank jullie wel voor deze leuke en leerzame tijd.
Terwijl ik mijn proefschrift aan het afronden was, ben ik bij Fontys gaan werken
bij de lerarenopleiding Nederlands. Arina, Bart, Bas, Chantall, Claudia, Elly, Esther,
Gerbert, Hanneke, Henriëtte, Jan, Julia, Kristien, Maartje, Maartje, Margriet,
Monica, Nanette, Petra, en Rudie, dank voor het mij verwelkomen en wegwijs
maken in een wereld die nieuw is voor mij. Dank ook voor jullie interesse ten tijde
van het inleveren van het manuscript en het delen in de vreugde toen ik bericht
van de commissie ontving.
Tot slot wil ik mijn lieve en leuke familie en vrienden bedanken voor het
deelnemen aan experimenten, het vragen én het niet vragen naar de voortgang,
het meedenken over de lay-out en de kaft, en nog meer voor de vele mooie,
grappige, bijzondere niet-proefschriftgerelateerde momenten. Een speciaal woord
van dank aan mijn ouders, wier betrokkenheid en zorgzaamheid oneindig groot is.
Contents
Voorwoord
Chapter 1
Introduction
1
Chapter 2
Stability of familiarity judgments:
individual variation and the invariant bigger picture
9
Chapter 3
Chapter 4
Variation is information: Analyses of variation
across items, participants, time, and methods
in metalinguistic judgment data
39
Predictive language processing revealing
usage-based variation
69
Chapter 5
Metalinguistic judgments are psycholinguistic data 101
Chapter 6
A concise guide to the design of multi-method
studies in linguistics: Combining corpus-based
measures with offline and online experimental data 117
Chapter 7
Discussion
135
References
145
Appendices
165
Summary
215
Samenvatting
223
Curriculum vitae
233
Introduction
1
Chapter 1 Introduction
Suppose a number of people encounter the utterance Bij gelijke geschiktheid gaat
onze voorkeur uit naar een vrouwelijke kandidaat (‘In case of equal qualifications,
we will give preference to female candidates’), to what extent would they differ in
the linguistic units they employ in processing it, and can we explain these
differences? For a long time, linguists have regarded words and grammatical rules
as the basic units in language. However, it has become increasingly clear that this
is not sufficient as a description of how language is organized in our minds, as
there is considerable evidence that we have a much more varied set of linguistic
units at our disposal. While an utterance such as Bij gelijke geschiktheid gaat onze
voorkeur uit naar een vrouwelijke kandidaat could be produced and understood
by accessing the individual words and the syntactic structure in which they are
embedded, speakers may also employ larger processing units. They can, for
example, make use of multi-word units (e.g. bij gelijke geschiktheid) and partially
schematic units (e.g. gaat ART/POSS voorkeur uit naar NP). As psycholinguistic
research has uncovered, some of these chunks of language are processed more
quickly, recalled more easily, and deemed more familiar than others. This
suggests that they differ from each other in representational strength, or, put
differently, in degree of entrenchment. Usage frequency appears to play a key role
in the process of entrenchment: the more a linguistic unit is used, the more it
becomes entrenched in the speaker’s mental lexicon, thus making it easier for
this speaker to retrieve and process it.
If usage-based models of linguistic representations are correct in positing such
a strong link between usage frequency and entrenchment, it follows that the
extent to which a linguistic unit is entrenched varies from person to person, as
well as over time. There is a shortage of empirical data on these types of variation,
though. As I will discuss in more detail in Section 1.1.1 and in the following
chapters, the past five decades have seen a wealth of studies yielding evidence in
support of usage-based theories of language acquisition and processing, but
these studies have paid little attention to inter- and intra-individual variation. A
central aim of the studies presented in this dissertation is to demonstrate that
insight into these types of variation is a prerequisite for a veridical description of
mental representations of language. The studies thus aim to contribute to usagebased theories of language by examining variation in entrenchment of multi-word
units.
2
Chapter 1
1.1 Usage-based linguistics
Linguistic theories ought to posit a model of linguistic knowledge that explains
that speakers can produce and understand an infinite number of utterances, that
also accounts for the ease and speed with which speakers are able to process
language, and that is learnable. Usage-based linguistics is a framework that
accounts for productivity, real-time processing, and learnability by envisioning
linguistic knowledge as dynamic networks of constructions which are shaped by
the cognitive response to social behavior, thus accommodating insights from
both psycholinguistics and sociolinguistics. In this framework, mental
representations of language consist of form-meaning pairings (i.e. constructions)
that are taken to emerge from, and are continuously shaped by, experience with
language together with general cognitive skills and processes such as
categorization, schematization, and chunking (Barlow & Kemmer 2000; Bybee
2006; Goldberg 2006; Tomasello 2003; A. Verhagen 2005). Linguistic
constructions vary in size –ranging from single morphemes (e.g. like) to multiword units (e.g. to all intents and purposes)– and in schematicity –ranging from
lexically specific constructions (e.g. equal qualifications) to partially schematic
(e.g. V-able) and fully schematic ones (e.g. SUBJECT VERB DIRECTOBJECT). The fact
that, on a usage-based account, language use continuously shapes mental
representations of language makes that linguistic constructions are entrenched
to varying degrees.
1.1.1 Degrees of entrenchment
Entrenchment can be defined as "the degree to which the formation and
activation of a cognitive unit is routinized and automated" (Schmid 2007:119; see
also Langacker 1987). Frequency of use is taken to be a key factor determining
degree of entrenchment. The more frequently a speaker encounters and uses a
particular linguistic structure, the more the mental representation of this structure
will become entrenched. As a result, it can be activated and processed more
quickly, which, in turn, increases the probability that this form is used to express
the given message, making this construction even more entrenched. Conversely,
extended periods of disuse weaken the representation (Langacker 1987: 59).
An impressive body of research shows that people are very much attuned to
frequency in language. We are sensitive to distributional properties of sound
sequences, morphemes, words, word sequences, and syntactic patterns, and we
make use of this information in language acquisition and processing (for
overviews see Diessel 2007; N. Ellis 2002; Gries & Divjak 2012; Saffran 2003).
With regard to multi-word units –the type of construction that I focus on in my
studies– numerous studies have demonstrated a strong relationship between the
frequency with which a word sequence occurs in the language and the extent to
Introduction
3
which its formation and activation in the minds of speakers is routinized, as
evidenced by pronunciation duration and phonological reduction (e.g. Arnon &
Cohen Priva 2013; Bannard & Matthews 2008; Bybee & Scheibman 1999; Janssen
& Barber 2012), perceptual identification (e.g. Caldwell-Harris, Berant & Edelman
2012), reading times (e.g. N. Ellis & Simpson-Vlach 2009; Fernandez Monsalve et
al. 2012; McDonald & Shillcock 2003; Siyanova-Chanturia, Conklin & van Heuven
2011; Smith & Levy 2013), phrasal decision times (e.g. Arnon & Snider 2010;
Jolsvai, McCauley & Christiansen 2013), and N400 effects (e.g. Frank et al. 2015).
These findings suggest that linguistic constructions vary in the extent to which
they are entrenched in speakers’ mental constructicons and that degree of
entrenchment is strongly correlated with usage frequency.
As Tomasello (2007: 282, as cited in Divjak 2016) aptly remarks, “[t]oday, very
few linguists would seriously deny the existence of frequency effects in language.
The real argument within linguistics is how far these effects go”. I propose that
an investigation of inter- and intra-individual variation in psycholinguistic data can
advance our understanding of the effects of usage frequency on language
processing and mental representations of language. These kinds of variation
naturally follow from a usage-based perspective. In order to do justice to the
usage-based approach, researchers ought to attend to such variation, examine to
what extent it is usage-based and what it reveals about the dynamic nature of
mental representations.
1.1.2 Variation in degrees of entrenchment
If representational strength is determined largely by usage frequency, there are
likely to be differences in entrenchment across individuals, even within a group
that is relatively homogeneous in terms of sociolinguistic characteristics, since
language users differ in their linguistic experiences. It is not known, though, how
large these differences are. Given that speakers are able to communicate rather
successfully, it appears that linguistic representations do not diverge widely. Still,
differences may be more profound than is often assumed. While sharing
knowledge of high-frequency schematic structures (e.g. the transitive
construction SUBJECT VERB DIRECTOBJECT) and a large inventory of specific
linguistic elements such as single words and multi-word chunks, speakers differ
in the extent to which they encounter and use particular words, word
combinations, and (partially) schematic constructions. The frequency with which
they experience such constructions differs, the contexts in which they encounter
them differ, and the ways in which they combine various constructions differ as
well. Such differences are expected to result in variation across speakers in
linguistic representations.
4
Chapter 1
In addition to inter-individual variation, a usage-based approach predicts intraindividual variation. Effects of usage on linguistic knowledge are not restricted to
children acquiring their mother tongue(s) and adults acquiring a foreign language;
they also hold for adult native speakers. All language users gain new linguistic
experiences throughout their lives, and usage-based linguistics predicts mental
representations of language to change accordingly.
To date, few studies have examined the variability of mental representations
of language in adult native speakers. Cognitive linguists often make use of corpus
data; these corpora are usually an amalgamation of texts and/or recordings of
spoken language from many different language users, which are unlikely to be
fully representative of the linguistic experiences of the people taking part in a
study and unlikely to be equally representative for all participants alike. Some
researchers have analyzed corpora composed of data of an individual speaker
(e.g. Barlow 2013; Dąbrowska 2014; Schmid & Mantlik 2015). Their findings point
to individual differences in the use of various constructions. However, patterns of
use as observed in corpus data cannot be equated with the degrees to which
constructions are entrenched in the mind of the speaker. In order to link these
patterns of use in corpus data to entrenchment, they need to be supplemented
with data from psycholinguistic experiments.
While it is starting to become common practice to analyze experimental data
by means of statistical models that account for individual differences (e.g. mixedeffects models), the variation present in psycholinguistic data is rarely analyzed
in its own right. Experimental data are usually reported as aggregated scores,
without regard for the degrees of variation and the information they may convey.
Furthermore, whenever a study involves multiple types of experimental tasks,
these are commonly conducted with different groups of participants.
Consequently, variation across tasks and variation across speakers are
confounded. As a result, such studies yield little insight into inter-individual
variation. In addition, participants are seldomly asked to perform a task multiple
times. Therefore, not much is known about the degrees of intra-individual variation
from one moment to another.
1.2 Multi-word units
In this dissertation, I focus on multi-word units as linguistic constructions. In the
last couple of decades, the importance of multi-word units in language acquisition
and processing has come to the fore. Analyses of the utterances produced by 2and 3-year-olds and the input they had received reveal that children stick close to
word strings they have encountered in the input (Dąbrowska & Lieven 2005). In
addition, experimental research has shown that the more frequently phrases
occur in child-directed speech, the better children are at processing and
Introduction
5
(re)producing them (Arnon & Clark 2011; Bannard & Matthews 2008; McCauley &
Christiansen 2014). These lexically specific constructions form the basis for
schematic constructions; by generalizing over specific instances, children are able
to arrive at more abstract schemas (Goldberg 2006). The emergence of
schematic constructions does not imply that multi-word units become less
important. In fact, usage-based theories consider more specific constructions as
more basic:
lower-level schemas, expressing regularities of only limited scope, may on
balance be more essential to language structure than high-level schemas
representing the broadest generalizations. (…) For many constructions, the
essential distributional information is supplied by lower-level schemas and
specific instantiations (Langacker 2000: 30-31).
Syntactic and semantic analyses of instances of various constructions provide
support for this point of view (e.g. A. Verhagen 2003). This is complemented by
empirical evidence that indicates that adult speakers store phrases and that the
use of these ready-made chunks facilitates sentence comprehension and
production (e.g. Arnon & Snider 2010; Arnon & Cohen Priva 2013; Bybee &
Scheibman 1999; Caldwell-Harris, Berant & Edelman 2012; Dąbrowska 2014; N.
Ellis & Simpson-Vlach 2009; Janssen & Barber 2012; Jolsvai, McCauley &
Christiansen 2013; Shaoul, Baayen & Westbury 2014; SiyanovaChanturia, Conklin
& van Heuven 2011; Tremblay & Baayen 2010). This has led cognitive linguists to
the viewpoint that the use of ready-made chunks is the basic mode of using
language (e.g. Bybee 2007: 279-280; Dąbrowska 2014: 642; Wray 2002, also see
Christiansen & Chater 2008, 2016 and McCauley, Isbilen & Christiansen 2017).
1.3 This dissertation
The studies presented in this dissertation examine variation between and within
participants in their metalinguistic judgments about, and processing of, multiword sequences. They investigate the variation present in the data and the extent
to which this variation can be considered meaningful. From a theoretical
perspective, insights into the degree of individual variation contribute to a
refinement of usage-based accounts. Findings indicate to what extent variation
should be part of linguistic descriptions. They also enable us to delineate more
precisely the limitations of different research methods that aim to tap into degrees
of entrenchment.
This dissertation also serves as a proof of concept. The studies employ
research designs and methods that are well suited to test hypotheses that follow
from usage-based theories of linguistic knowledge and language processing, and
to yield insight into inter- and intra-individual variation. The approach adopted here
can be extended, in future research, to other groups of speakers, other linguistic
6
Chapter 1
registers, and other types of linguistic constructions. In this dissertation, multiword units are the construction of interest, since they have been shown to play a
pivotal role in language processing. Another reason to focus on multi-word
sequences is that this type of construction lends itself well to the investigation of
usage-based variation. Registers and social groups are likely to differ more
notably in the usage of multi-word units than in experience with schematic
constructions. Schematic constructions have a more general and abstract
meaning than lexically specific constructions. As such, schematic constructions
may be less sensitive to differences in usage contexts that differ from one person
to another. In Chapter 7, I discuss to what extent the findings presented in this
dissertation can be expected to hold for constructions other than multi-word units.
1.3.1 Outline
Chapters 2 through 5 report on experimental research combining corpus analyses
and psycholinguistic data. In Chapter 6, I reflect on the methodological lessons
that can be learned from these studies; in Chapter 7, I discuss the theoretical
implications. Chapters 2, 3, 4, and 6 are based on articles published or submitted
for publication in peer-reviewed journals.
Chapters 2 and 3 present two studies that examine inter- and intra-individual
variation in metalinguistic judgments. The latter is investigated by means of a
test-retest design: participants performed the same task twice within the space
of one to three weeks. In both studies, participants were asked to assign
familiarity ratings, using the method of Magnitude Estimation, to a set of
prepositional phrases that cover a wide range of corpus frequencies. In Chapter
2, these phrases were presented in isolation as well as in a sentential context, to
investigate whether context affects perceived degree of familiarity and inter- and
intra-individual variation in judgments. The judgment task in Chapter 3 involved
isolated phrases only. In this study, participants used either a 7-point Likert scale
or a Magnitude Estimation scale. The research design employed in Chapter 3 thus
yielded data on variation across items, across participants, across time, and
across rating methods.
Chapters 4 and 5 report on three experiments that were conducted with three
groups of participants: recruiters, job-seekers, and people not (yet) looking for a
job. These groups can be expected to differ in experience with word sequences
that typically occur in job ads (e.g. goede contactuele eigenschappen ‘good
communication skills’); they are not expected to differ systematically in
experience with word sequences characteristic of news reports (e.g. de Tweede
Kamer ‘the House of Representatives’). The participants first performed a
completion task, which offers insight into the expectations people generate about
Introduction
7
upcoming words. This was followed by a voice onset time (VOT) task, which
provides data on the speed with which the participants process the word strings.
After that, the participants assigned familiarity ratings to the word sequences
using Magnitude Estimation. Chapter 4 reports on the completion task and the
VOT task; Chapter 5 reports on the metalinguistic judgment task.
In Chapters 4 and 5, I examine the relationship between amount of experience
with a particular register and (i) the expectations people generate about
upcoming words when faced with word strings characteristic of that register; (ii)
the speed with which they process such word strings; and (iii) how familiar they
consider these word strings to be. Furthermore, I investigate the relationships
between data elicited from an individual participant in different types of
psycholinguistic tasks using the same stimuli. Comparisons of participant-based
measures and measures based on amalgamated data of different people as
predictors of performance in psycholinguistic tasks provide insight into individual
variation and the merits of going beyond amalgamated data.
Chapter 6 highlights the merits of multi-method research in linguistics and
offers an overview of key considerations in the design of such research. Chapter
7, finally, provides a summary of the main findings and discusses the theoretical
implications as well as suggestions for future research.
8
Chapter 2
Abstract
Judgments are often used in linguistic research. Not much is known, however,
about the variation of such judgments within and between participants. From a
usage-based perspective, variation might be expected: with judgments based in
representations, and representations resulting from input and use, both inter- and
intra-individual variation are likely. This study investigates the reliability of
metalinguistic judgments, more specifically familiarity judgments, for Dutch
prepositional phrases (e.g. op de bank, ‘on the couch’). Familiarity judgments for
44 PPs offered in isolation and in a sentential context were given by 86
participants in two identical test sessions, using Magnitude Estimation.
Aggregated scores (averaged over participants) are remarkably consistent
(Pearson’s r = .97), and in part predicted by corpus frequencies. At the same time,
there is considerable variation between and within participants. Context does not
reduce this variation. We interpret both the stability and instability to be real
reflections of language: a relatively stable system in a speech community
consisting of speakers who are variable and forever changing. The results suggest
that judgment data are informative at different levels of granularity. They call for
more attention to individual variation and its underlying dynamics.
This chapter is based on:
Verhagen, V. & Mos, M. (2016). Stability of familiarity judgments: Individual
variation and the invariant bigger picture. Cognitive Linguistics, 27(3), 307–344.
https://doi.org/10.1515/cog-2015-0063
Acknowledgements
I thank Dominique Mellema for his help in collecting the data, and Dagmar Divjak
and two anonymous reviewers for their helpful comments on the paper we
submitted to Cognitive Linguistics.
Stability of familiarity judgments
9
Chapter 2 Stability of familiarity judgments:
individual variation and the invariant bigger picture
2.1 Introduction
Metalinguistic judgments constitute an oft-used type of data in a variety of fields
within linguistics, ranging from grammaticality and acceptability judgments (e.g.
Sprouse & Almeida 2012 for syntactic patterns; N. Ellis & Simpson-Vlach 2009 for
formulaic language; Granger 1998 and Gries & Wulff 2009 for collocations and
constructions in L2 speakers) to judgments regarding productivity (e.g. Backus &
Mos 2011) and idiomaticity (e.g. Wulff 2009). Various researchers have criticized
the validity and reliability of metalinguistic judgments (e.g. BornkesselSchlesewsky & Schlesewsky 2007; Sampson 2007). Still, the general assumption
behind the use of judgment data in linguistic research is that they provide us with
information about linguistic representations, overlaid with certain amounts of
processing difficulty, depending on the specifics of the task and the setting, that
cannot be deduced from natural language use or psycholinguistic, experimental
data. All the more remarkable is the fact that we do not know how stable and
therefore reliable such judgments are. Already in 1987, Labov stated: “The most
obvious hiatus in the foundations of modern linguistics is the absence of a
concern for the reliability and validity of the introspective judgments that form the
main data base of grammatical research”.
Since Labov’s observation, several decades have passed and still the reliability
of metalinguistic judgments has not been investigated thoroughly. To be sure,
there is a large body of literature on ratings (for an overview see Schütze &
Sprouse 2013) and various studies have compared judgment data to other types
of data such as expert intuitions (Dąbrowska 2010), textbook classifications
(Sprouse & Almeida 2012), and corpus data (Balota et al. 2001). However, such
comparisons do not provide conclusive evidence about the stability of and
variation in judgments. Typically, judgments by different participants are averaged
and inter-individual differences are regarded as ‘noise’ (but not always, viz.
Dąbrowska 2012, Dąbrowska 2013; Barlow 2013; Barth & Kapatsinski 2014).
Given that people differ in their linguistic experiences and in the language they
produce themselves, individual differences actually are to be expected in
judgment data (depending on the items that are judged, a point to which we will
return below). A discrepancy between one person’s judgments and those of other
people, or between someone’s judgments and corpus data, does not necessarily
invalidate these judgments. People may differ from each other in real and
meaningful ways, each expressing their own linguistic representations. The most
10
Chapter 2
thorough and direct way of examining the stability of judgments, while allowing
for differences between individuals as well as between items, is to have people
judge the same linguistic stimuli several times, which is not common practice.
In this paper, we address the issue of variability in linguistic judgments. The
paper starts by introducing the particular type of stimulus items and judgment
used in the current study: familiarity ratings for multi-word units. We argue where
and why differences between people (hereafter inter-individual variation) as well
as within a single language user (intra-individual variation) might be expected.
This is followed by a discussion of an important factor that could influence these
two types of variation: providing a context to stimulus items. We then report on
the outcomes of an experimental study into the stability of metalinguistic
judgments and the relationship between these judgments and corpus data. We
argue how the observed stability and instability in judgments could be accounted
for in a usage-based framework and how it calls for further investigation of the
variability of (meta)linguistic representations. As such, this study contributes to
our understanding of the relation between individuals’ judgments on the one hand
and their linguistic representations as well as the entrenchment of patterns in the
speech community on the other.
2.1.1 Judging multi-word units
In this study we focus on multi-word units, and the judgment data concern the
perceived familiarity of these units. A multi-word unit is a string of words that are
taken to be stored together, as a whole, in one’s linguistic repertoire (a.o. Wray
2002). Multi-word units have characteristics that make them suitable to be
assessed in a familiarity judgment task. They are small enough to be stored as
chunks. Moreover, they are plausible units as they form a semantic and syntactic
unity. This also means that it is easier for people to provide familiarity ratings for
multi-word strings than for entire sentences, skip-grams (i.e. discontinuous multiword n-grams, such as go to … lengths) or bound morphemes.
The basis for the entrenchment of multi-word sequences is the fact that words
tend to occur in certain constructions and collocate to form multi-word units
(Stefanowitsch & Gries 2003 and many others). Numerous studies provide
evidence that language users are sensitive to the likelihood of words to co-occur
(e.g. Jurafsky et al. 2001; Mos et al. 2012). If one takes a usage-based perspective
on language processing and representation, as we do here, distributional patterns
are inextricably related to one’s cognitive linguistic representations, as knowledge
of a language is in large part built from (mostly implicit) memories of past
linguistic experiences (see, for example, J. Taylor 2012). To put it more precisely,
our linguistic representations emerge from our experience with language —that is,
the language we encounter and produce ourselves— together with general
Stability of familiarity judgments
11
cognitive skills and processes such as schematization, categorization and
chunking. The latter, of particular importance here, is the process “by which
sequences of units that are used together, cohere to form more complex units”
(Bybee 2010: 7). ‘Complex’ here means that the unit consists of multiple elements
that are packaged together in cognition. The process of chunking is thought to
occur in adults as readily as in children, and applies to all kinds of sequences of
linguistic elements.
The principal experience that triggers chunking of multi-word sequences is
frequency-based: repetition (Bybee 2010). The more a sequence of words is used
together, the more entrenched it becomes as one chunk. An impressive body of
research has revealed a log-linear relationship between usage frequency –usually
estimated on the basis of corpus data— and processing as measured in
psycholinguistic experiments (see for instance N. Ellis 2002; Diessel 2007).
Furthermore, log-transformed frequency scores have been shown to resemble the
way language users perceive differences in frequency (e.g. Popiel & McRae 1988
for idioms; Balota et al. 2001 for single words).
These studies, however interesting, do not tell us much about variation in
individuals’ cognitive representations of multi-word sequences —that is, the
synchronic result of accumulated exposure and chunking— nor about people’s
ability to reliably report on these representations. In order to investigate the
perceived degree of ‘chunkiness’ of a word sequence we designed a set of
prepositional phrases and asked people to judge these phrases twice within the
space of a few weeks (a more detailed description is given in Section 2.2 below).
Participants were asked to provide familiarity judgments. Familiarity of a word
sequence (or any other type of linguistic element) is taken to rest on frequency
and similarity to other words, constructions or phrases (e.g. Bybee 2010: 214). As
such, familiarity taps into exposure and chunking, while it does not require
introducing a new concept to participants. Asking participants to provide ratings
for ‘familiarity’ rather than ‘entrenchment’, ‘chunkiness’, or ‘unit status’ means
that it is not necessary to introduce jargon. Furthermore, it does not evoke a
right/wrong distinction, and the concept of familiarity involves both one’s own
usage and one’s experience with other people’s use of the items.
A substantial number of studies have made use of familiarity ratings for words,
word pairs, phrases, idioms, and metaphors. These ratings were found to be
significant predictors of reading times (e.g. Cronk et al. 1993; Juhasz & Rayner
2003; Williams & Morris 2004), as well as performance on lexical decision and
speeded naming tasks (e.g. Gernsbacher 1984; Connine et al. 1990; Blasko &
Connine 1993; Juhasz et al. 2015), speeded semantic judgment tasks (among
others, Tabossi et al. 2009), and perceptual identification tasks (Caldwell-Harris
et al. 2012). Gernsbacher (1984: 227) states that asking participants to rate how
12
Chapter 2
familiar they are with a word is a simple tool for collecting a measure of the extent
and type of previous experience respondents have had with each word. Juhasz et
al. (2015: 1005), in like manner, write: “Rated familiarity can be thought of as a
measure of subjective frequency such that it indexes the experience that an
individual has with a given word.” As familiarity crucially depends on prior
linguistic experiences, it implies variation, both across speakers and over time.
These two types of variation are discussed in more detail successively.
2.1.2 Inter- and intra- individual variation
People differ, from one person to the next, in the way in which, and the frequency
with which, they encounter and use particular word strings. As J. Taylor (2012:
250) puts it: “It is evident even to the most casual observer that speakers of the
‘same’ language may exhibit variation in their usage patterns according to their
geographical provenance, their social status, their educational background, their
age, gender, ethnicity, and so on”. If linguistic representations are assumed to be
based on one’s linguistic experiences, such differences are expected to give rise
to variation in these linguistic representations.
Within the Cognitive Linguistics framework, the idea that people may differ
considerably in their linguistic knowledge, not just at the level of lexical repertoires,
has been put forward convincingly by Dąbrowska (2012, 2013), among others.
She discusses a number of recent studies showing that adult monolingual native
speakers of the same language do not share the same mental grammar.
Dąbrowska argues that these differences may be caused by various factors. At
times, it appears that speakers attend to different cues in the input. It may also
well be the case that for certain constructions, some speakers extract only
specific, ‘local’ generalizations, while others acquire more abstract rules. More
educated speakers appear to acquire more general rules, possibly as a result of
more varied linguistic experience.
There is reason to suspect that inter-individual variation may be particularly
large when it comes to multi-word units. Language users are likely to share a large
inventory of small, specific linguistic elements, such as single words and small
chunks, e.g. HET BOEK, the choice of a neuter definite article in combination with
the noun boek, as this combination is very frequent and alternatives, e.g. DE BOEK,
the non-neuter definite article + boek, are (nearly) absent in the ambient language.
Linguistic representations of larger, very general structures will be very similar too.
An example of such a construction is the transitive pattern SUBJECT VERB OBJECT in
which an Agent does something to a Patient. While the transitive sentences two
speakers encounter will differ in content, the commonalities in meaning and
structure enable the two speakers to arrive at similar abstract representations.
People, most likely, differ to a larger extent in how, and how often, they encounter
Stability of familiarity judgments
13
and use particular combinations of words and chunks. For example, the words
vast (fixed, firm, certainly) and zeker (safe, certain, probably) are used frequently
by both speakers of Belgian Dutch and speakers of Netherlandic Dutch. These
two groups differ, however, in how they combine the two words in a multi-word
unit that means ‘definitely’. Both the orders vast en zeker and zeker en vast are
observed. But Flemish speakers tend to prefer zeker en vast (at a ratio of
approximately 4:1), whereas in the Netherlands vast en zeker is more frequent (at
7:1).1 So, while Belgians and Dutch differ relatively little in usage frequency of the
single words, they differ markedly in how and how often they use the two multiword units and, presumably, in how familiar they consider each of them to be.
Investigations of the differences in language use between Belgians and Dutch
are one example of the ways in which inter-individual variation is commonly
studied: variation between speakers is examined by comparing groups that differ
in terms of location (dialect), SES (sociolects) or ethnicity (ethnolects). However,
also within such groups of speakers, there are likely to be differences between
people in linguistic representations, as two persons are never identical in their
language use and language exposure. In most linguistic judgment studies,
variation between participants is either ignored, or reported as standard deviations
but not discussed as a result in itself, or only taken into account by comparing
groups of speakers. A usage-based perspective calls for an investigation that
looks beyond such group averages. It also entails that differences between people
in metalinguistic judgment are not sufficient to warrant the conclusion that these
judgments are unreliable. Such differences may reflect genuine and meaningful
differences in linguistic representations. In this study, the focus is on the variation,
in order to shed a more complete light on the interplay of individual linguistic
representations and the language system of a speech community.
In addition to inter-individual variation, a usage-based approach predicts intraindividual variation. If knowledge of a language in large part arises from usage, it
is inherently dynamic. One’s linguistic experiences change over time; one’s
linguistic representations are taken to change accordingly. Metalinguistic
judgments based on changeable representations, therefore, are not expected to
be stable over time. But what if the time frame is limited to a fairly short period in
which the use of the word strings in question has not changed much? How
(un)stable are people’s judgments when they are to grade the same set of stimuli
1
Ratios taken from the SoNaR corpus, a balanced, 500-million-word reference
corpus of contemporary written Dutch texts of various styles, genres and sources,
originating from the Dutch speaking area of Belgium (Flanders) and the
Netherlands, as well as Dutch translations published in and targeted at this area
(Oostdijk et al. 2013).
14
Chapter 2
twice within a time span short enough for usage not to have changed much, yet
long enough not to be able to recall the exact scores assigned the first time?
Even when usage frequency hasn’t changed much for a particular stimulus,
judgments regarding its familiarity may vary from one moment to the other due
to differences in associations and the frame of reference used.2 In judging
familiarity, a speaker will activate potential uses of a given stimulus. The ease with
which this is done, and the kinds of frames activated are highly dependent on the
linguistic and extra-linguistic context. In the following section possible effects of
context are discussed in more detail.
2.1.3 Context
Both the (extra-)linguistic context in which a participant encounters a stimulus
and the (extra-) linguistic contexts the word string evokes, contribute to a frame
of reference in which the stimulus is assessed. The extra-linguistic context —
roughly speaking the setting in which the language use takes place— evokes
scenarios a language user employs to interpret the linguistic input (Lakoff 1987),
e.g. as a customer in a restaurant setting, it is perfectly fine to be told “let me tell
you what today’s specials are”, followed by an enumeration of dishes. While
clearly relevant for language use, this is not the type of context we focus on here.
By having the participants in the current study perform the task in the exact same
setting (location, experiment leader, instructions, format), we controlled for
variation in the extra-linguistic context.
What we explore is how providing a sentential context for the stimuli may
influence variation in metalinguistic judgments. Survey studies and studies of realtime language comprehension have shown that the immediate linguistic context
affects the way in which word strings are interpreted, processed, and responded
to (e.g. Camblin et al. 2007; Kamoen 2012). When it comes to empirical studies
involving metalinguistic judgments, such context is usually deliberately absent. In
lexical decision tasks, for example, the stimulus is the isolated word (or words)
that participants must recognize, not a (non-)word in a sentence. For
grammaticality judgments, the unit that is assessed is the isolated sentence
(numerous examples in Sprouse et al. 2013). Any influence of linguistic elements
2
One other obvious potential cause of intra-individual variation in familiarity
ratings would be recent exposure, i.e. priming effects (e.g. Luka & Barsalou 2005;
Schwanenflugel & Gaviska 2005). This is not the focus of the current study.
Effects of recency, salience and other related concepts in exposure prior to a
judgment task would have to be manipulated and/or measured systematically for
participants. This would involve a tightly controlled experimental setting, with all
linguistic exposure recorded.
Stability of familiarity judgments
15
other than the phenomenon under investigation, would usually be regarded as
noise.
For judgments regarding the familiarity of units such as the prepositional
phrases (PPs) we are investigating here, providing a context encapsulates the
stimulus in a setting that makes it arguably more meaningful and realistic. In
natural language use, these phrases do not occur in and of themselves; they occur
in utterances. When a phrase is presented as an isolated word string, it may evoke
different meanings and usage contexts across participants, and also within one
person from one moment to another. Adding a context could reduce variation, as
participants are prompted to focus on the same instance. For instance, when
reading the words ‘on the door’, one may think of a poster hanging on the door,
the practical joke with the bucket on the door, or someone knocking on the door.
The number and kinds of usage contexts and the ease with which they come to
mind will influence familiarity judgments. Diversity in associations may be related
to differences in linguistic experiences, but it could also be more coincidental,
resulting in less consensus among participants and more instability over time.
It is, as yet, an open question to what extent variation in familiarity judgments
changes when the target items are embedded in a sentence. A sentential context
activates a specific sense and generates an exemplar, which may guide the
process of judging the item. For phrases that are used frequently, participants can
easily come up with exemplars themselves. Presenting such frequent items in a
sentence will probably not affect ratings much, provided that the sentence
corresponds to participants’ associations. Should the context not resemble the
exemplars participants were thinking of, the scores may be lowered. For lowfrequency stimuli, participants are more likely to have difficulties coming up with
an exemplar. Giving a sentence context could then heighten the sense of
familiarity, if it activates memory traces of very similar usage. If the given
sentence context is not one that the participant recognizes, the effect could be
that the item itself is rated as less familiar. Given that only one sense is mentioned,
other possible uses of the item may not be taken into consideration. The PPs
presented in this study were all fairly common phrases, many if not all of them
polysemous or even homonyms (as [1]).
1. Op de bank
on the couch/bank
De jongens liggen op de bank televisie te kijken.
The boys lie on the couch television to watch
The boys are lying on the couch watching TV.
The context provided by the sentence in (1) is one that occurs frequently with this
PP in the Corpus of Spoken Dutch, i.e. with an animate agent positioned [on the
couch] involved in an activity. However, the word bank is a homonym; it can refer
16
Chapter 2
to a piece of furniture, as well as to a financial institution. The context generates
a clear exemplar of the word in one sense, but at the same time rules out the other
sense.
So, concluding: context may push the sense of familiarity up or down,
depending on whether the provided context ties in with associations triggered in
a participant’s mind. Regardless of the direction, the expectation is that contexts
reduce intra-individual variation in judgments as they steer what sense is evoked.
Context may also reduce inter-individual variation as it stimulates participants to
focus all on the same kind of exemplar, but this crucially depends on the extent
to which a specific context is familiar to different participants. For high-frequency
stimuli, effects of context are expected to be smaller. These stimuli are more likely
to evoke the same kinds of exemplars across participants and at different points
in time, and the contexts provided are likely to be recognizable to many of them.
2.1.4 Research questions
To start with, we examine the extent to which familiarity judgments are related to
usage frequency and influenced by context. In our main analyses we investigate
how stable these familiarity judgments are, looking at both inter- and intraindividual variation, and to what extent the stability varies depending on the
frequency of the word combination and the presence of a context.
Given that familiarity ratings are taken to rest on usage frequency and
similarity to other constructions, we expect to find a correlation between ratings
and corpus frequencies. Furthermore, inter-individual variation in ratings is to be
expected, since people differ in their linguistic experiences. Intra-individual
variation is hypothesized to be smaller, as the rating sessions take place in a fairly
short period in which the use of the word strings in question will not have changed
much. We expect that embedding the stimuli in a context will reduce intraindividual variation in judgments as the context steers what sense is evoked.
Whether or not context reduces inter-individual variation depends on the extent to
which a specific context is familiar to different participants. Finally, the more
frequent the item, the smaller effects of context are expected to be.
In other to test these hypotheses, we had participants judge the same linguistic
stimuli twice within a relatively short period of time, in the same experimental
setting. The data yield insight into the ways in which individual linguistic
representations and the language system of a speech community are interrelated.
2.2
Method
2.2.1 Design
In order to test the stability of linguistic familiarity judgments for items with a
range in frequency, and the influence of presenting these items in isolation or in
Stability of familiarity judgments
17
context, a 2 (TIME) x 2 (CONTEXT) fully within-participant design was used. All
participants rated 44 items both in isolation and in context, twice within the space
of two to three weeks.
2.2.2 Participants
The participants were 86 students of Communication and Information Sciences
at Tilburg University (66 female, 20 male) with an average age of 21.6 years. All
of them were native speakers of Dutch. They participated for course credit.
2.2.3 Material
2.2.3.1 Stimulus items
Participants were asked to rate 44 Prepositional Phrases (PPs) consisting of a
preposition and a singular noun, and in a majority of the cases a determiner (i.e.
35 with a definite article, 1 possessive zijn ‘his’). An initial set of items was taken
from V. Verhagen and Backus (2011) from which a selection was made based on
two frequency characteristics: they represented a wide range in frequency (from
9 to 1066) in the approximately ten million word Corpus of Spoken Dutch ( Corpus
Gesproken Nederlands, henceforth CGN) and for all items this particular P–
(Det)–N combination was the most frequent one compared to configurations
with other determiners and inflectional forms of the noun (for a full list of items,
and frequency data in CGN, see Appendices 2.1 and 2.2).3
For each PP a context sentence was created with a full lexical verb and often
a nominal subject and object based on its occurrences in CGN (e.g. in de kast ‘in
the cupboard’ often co-occurs with leggen ‘lay’, describing events in which
someone puts something in a cupboard). The sentences were between 6 and 12
words long, with the PP occurring in the second half of the sentence but never as
the final constituent, as in (2). We made sure not to refer to entities that may
evoke strong feelings (e.g. ‘Saddam Hussein’). All sentences are listed in
Appendix 2.1.
2. Ze heeft de spulletjes in de kast gelegd.
She has the little-stuff in the cupboard put.
She put the things in the cupboard.
3
CGN is a fairly small corpus. When SoNaR (a balanced reference corpus of
contemporary written standard Dutch [Oostdijk et al. 2013]) became available, we
investigated how often the items occur in the Netherlandic Dutch subset
consisting of 143.83 million words. For both the PP as a whole and the noun
(lemma search) there is a strong correlation between the CGN and the SoNaR
frequencies (r = .93 and r = .90 respectively).
18
Chapter 2
2.2.3.2 Judgment task
Participants were asked to rate familiarity using Magnitude Estimation (Bard et
al. 1996). In this type of task, no set judgment scale is provided to the participants.
Instead, participants rate each stimulus relative to the preceding one. This
procedure requires a brief introduction and practice session (see Section 2.2.4).
The construct of familiarity is clearly a gradual one, which fits well with the ratings
provided by participants in a Magnitude Estimation task. Such a task allows
participants to build their own scale. In contrast to a Likert scale, a Magnitude
Estimation scale does impose a limited set of degrees of familiarity. The scale is
open-ended, meaning that it is always possible to add higher or lower scores.
Furthermore, participants are free to make as many fine-grained distinctions as
they feel appropriate. Magnitude Estimation has been used successfully in
judgments of grammatical well-formedness (e.g. Bader & Häussler 2010),
productivity of morphological and modal verb constructions (Backus & Mos 2011)
as well as idiomaticity (Wulff 2009). Among these, Wulff explicitly mentions that
inter-subject consistency was extremely high, and Backus and Mos report high
reliability measures (Cronbach’s α = .85). In a follow-up study (reported on in
Chapter 3), highly similar to the one reported here, we asked a new group of
participants to give familiarity ratings at two points in time using either a
Magnitude Estimation or a 7-point Likert scale. The type of scale does not appear
to influence the degree of inter- and intra-individual variation much.
2.2.4 Procedure
The experiment was carried out in one computer room in the participants’ faculty
building under a research assistant’s supervision. All participants completed the
experiment twice, with a period of two to three weeks between the first and
second session. They knew in advance that the experiment involved two test
sessions, but not that they would be doing the exact same task twice. Given that
the stimuli concern prepositional phrases that typically occur in everyday
language use, our participants have about 20 years of linguistic experiences that
contribute to their cognitive representations of these word strings. From that
viewpoint, three weeks is a relatively short time span. Furthermore, there is no
reason to assume that the use of the word combinations under investigation
changes much in these three weeks. Therefore, the interval is not expected to
bring about noticeable alterations in cognitive representations and metalinguistic
judgments regarding the stimuli.
The items were presented in an online questionnaire form (using the Qualtrics
software program) and this was also the environment within which the ratings
were given. After signing a consent form and filling out a brief questionnaire
regarding demographic variables (age, gender, language background),
Stability of familiarity judgments
19
participants were introduced to the notion of relative ratings through the example
of comparing the size of depicted clouds and expressing this relationship in
numbers. They were instructed to rate each stimulus relative to the immediately
preceding one, as this is what participants are inclined to do, rather than
comparing each stimulus to a fixed modulus (e.g. Sprouse 2008). In a brief
practice session, participants gave familiarity ratings to verb–object
combinations (e.g. veters strikken ‘to tie shoe laces’). Before starting the main
experiment, they were given a few tips, i.e. not to restrict their ratings to the scale
used in the Dutch grading system (1 to 10, with 10 being a perfect score), not to
assign negative numbers, and not starting very low, to allow for subsequent lower
ratings.
The main experiment consisted of two blocks: one in which the PPs were
presented in isolation, and one with the PPs embedded in a sentence (with the PP
underlined). Within each block, the order of presentation was randomized for each
participant. Half of the participants started with the isolated block of items, the
other half with the items in sentence contexts. The instructions were to rate
familiarity of the word combination (“Hoe vertrouwd vind je deze combinatie van
woorden?” – ‘How familiar do you consider this combination of words?’). In
earlier studies using familiarity ratings (e.g. Blasko and Connine 1999; Juhasz and
Rayner 2003), the instructions for participants are very concise, illustrating that
the term ‘familiarity’ can be understood without much introduction. Usually,
participants are simply asked to rate how familiar they are with a stimulus on a 5or 7-point Likert scale. When guidelines are provided, they refer to usage
frequency. Williams and Morris (2004), for instance, asked participants to rate
how often they had seen a given word. Juhasz et al. (2015, Appendix) used the
phrasing “if you feel you know the meaning of the word and use it frequently, then
give it a high rating on this scale”.
Before judging the isolated word strings, our participants were told: If you wish,
you could think of the combination in a particular context before judging it. Before
rating the stimuli in sentences they were informed: You will see a word
combination in a sentence. We would like to ask you to judge the familiarity of the
underlined phrase in this specific context. We did not verify how carefully
participants read the context. Given that the PP appeared in different positions on
the screen, participants could not keep their eyes focused on one spot. The
context consisted of just one sentence and it would have been difficult to refrain
from reading it automatically.
2.2.5 Data transformations
For each participant, the ratings provided within one session were converted to Zscores to make comparisons of relative ratings possible. This transformation is
20
Chapter 2
relatively common in acceptability judgments (Bader & Häussler 2010; Schütze &
Sprouse 2013), as it involves no loss of information on ranking, nor at the interval
level. By converting into Z-scores, a score of 0 indicates that a particular item is
judged by a participant to be of average familiarity compared to the other items.
For each item, Appendix 2.2 lists the mean of the Z-scores of all participants for
that item, and the standard deviation.
To investigate the stability in judgment, a Z-score for an item in the second
session was deducted from its score in the first session. The differences, or Δscores, were used to analyze the extent to which a participant rated an item
differently over time (e.g. if a participant’s rating for naar huis yielded a Z-score
of 1.0 in the first session, and 0.5 in the second, the Δ-score is 0.5; if it was 1.0
the first time, and 1.5 the second time, the Δ-score is also 0.5, as the instability of
the judgment is of the same magnitude). Absolute Δ-scores are used here, since
it is of no importance for our research questions whether the difference in scores
involves a higher or a lower score at Time 2. As participants constructed a scale
at Time 1 and a new one at Time 2, ratings were converted into Z-scores at Time
1 and Time 2 separately. Consequently, we cannot determine whether participants
might have considered all stimuli more familiar the second time. Since we used
stimuli that are common in everyday language use, we have no reason to assume
that their use and their perceived familiarity changed much within a period of two
to three weeks. In order to investigate whether ratings move in one or another
direction we need participants to use a fixed scale, for example a 7-point Likert
scale. For this, we refer to the follow-up study in which a fixed scale was used
(Chapter 3).
In order to relate familiarity judgments to frequency of the rated items,
frequency counts of the exact word string in CGN were queried and subsequently
log-transformed. The same was done for the frequency of the noun (lemma
search). To give an example, the phrase naar huis occurred 1066 times in CGN,
which corresponds to a log-transformed frequency score of 2.05. The lemma
frequency of the noun, which encompasses occurrences of huizen, huisje, huisjes
in addition to huis, amounts to 4730 instances. This corresponds to a logtransformed frequency score of 2.70. Figure 2.1 shows the positions of the stimuli
on the phrase frequency scale and the lemma frequency scale; Appendix 2.2 lists
for all stimuli the raw and the log-transformed frequencies.
Stability of familiarity judgments
21
Figure 2.1 Scatterplot of the relationship between the log-transformed corpus
frequency of the PP and that of the N (r = .59). The numbers 1 to 44
identify the individual stimuli (see Appendices).
2.2.6 Statistical analyses
First of all, we investigated to what extent the familiarity judgments can be
predicted by the log-transformed frequency of the specific phrase (LOGFREQPP)
and the log-transformed lemma-frequency of the noun (LOGFREQN), and to what
degree the factors CONTEXT and TIME (i.e. first or second session) exert influence.
The stability of the judgments was investigated in a separate analysis.
We ran linear mixed-effects models (Baayen et al. 2008), using the function
lmer from the lme4 package in the R software program (www.r-project.org). As
Baayen and Milin (2010) state, mixed-models obviate the necessity of prior
averaging over participants and/or items, and thereby offer the researcher the far
more ambitious goal to model the individual response of a given participant to a
given item.
In the first analysis, LOGFREQPP, LOGFREQN and CONTEXT were included as fixed
effects, and so were all two-way interactions. Note that there cannot be a main
effect of TIME in this analysis, since scores were converted to Z-scores for the two
sessions separately (i.e. the mean scores at Time 1 and Time 2 were 0). In the
mixed-effects models we did include the two-way interactions of TIME and the
other factors. The fixed effects were standardized.
Participants and items were included as random effects. We incorporated a
random intercept for items and random slopes for both items and participants to
account for between-item and between-participant variation. The model does not
contain a by-participant random intercept, because after the Z-score
transformation all participants’ scores have a mean of 0 and a standard deviation
of 1. Furthermore, we excluded by-item random slopes for the factors LOGFREQPP
and LOGFREQN, because each item has only one phrase frequency and one lemma
22
Chapter 2
frequency. Within these limits, a model with a full random effect structure was
constructed following Barr et al. (2013). As the model did not converge, we
excluded random slopes with the lowest variance step by step. When we obtained
a converging model, a comparison with the intercept-only model proved that the
inclusion of the by-item random slope for CONTEXT and the by-participant random
slopes for the three fixed effects and for the interactions LOGFREQPP x CONTEXT
and CONTEXT x TIME was justified by the data (χ2(17) = 875.36, p < .001).
In the second analysis, we investigated the stability of the judgments. We ran
linear mixed-effects models on the Δ-scores computed for the ratings of each
participant on each item in the two sessions (see Section 2.2.5 Data
transformations). The absolute Δ-scores indicate the extent to which a
participant’s rating for a particular item at Time 2 differs from the rating at Time
1. For each item, we have a list of 86 Δ-scores that express each participant’s
stability in the grading. In order to fit a linear mixed-effects model on the set of Δscores, we log-transformed them using the natural logarithm function.4
We analyzed the log-transformed Δ-scores using linear mixed-models.
LOGFREQPP, LOGFREQN and CONTEXT were included as fixed effects, participants
and items as random effects. The fixed effects were standardized. We included a
by-item random intercept and random slope for CONTEXT. For participants, we
included a random intercept and random slopes for LOGFREQPP, LOGFREQN and
CONTEXT. As the model did not converge, we excluded random slopes with the
lowest variance step by step. When we obtained a converging model, a
comparison with the intercept-only model proved that the inclusion of the bysubject random slopes for LOGFREQPP and CONTEXT was justified by the data
(χ2(5) = 79.28, p < .001).
2.3
Results
2.3.1 Relating familiarity judgments to frequency and context
By means of linear mixed-effects models, we investigated to what extent the
familiarity judgments can be predicted by the log-transformed frequency of the
specific phrase (LOGFREQPP) and the log-transformed lemma-frequency of the
noun (LOGFREQN), and to what degree the factors CONTEXT and TIME (i.e. first or
second session) exert influence.5 The resulting model is summarized in Table 2.1
The absolute Δ-scores constitute the positive half of a normal distribution. Logtransforming the scores yields a normal distribution, thus complying with the
assumptions of parametric statistical tests.
5
Half of the participants first rated the phrases in isolation and then rated the
same phrases embedded in a sentence; the other half did it the other way around.
We tested whether this affected ratings by including the factor ORDERCONTEXT in
4
Stability of familiarity judgments
23
(confidence intervals were obtained via parametric bootstrapping over 100
iterations). The variance explained by this model is 33% (R2m = .16, R2c = .33).6
Table 2.1
Estimated coefficients, standard errors, and 95% confidence
intervals for the mixed-model fitted to the familiarity ratings.
B
SE b
95 % CI
0.01
0.05
-0.09, 0.10
LogFreqPP
0.46
0.07
0.34, 0.60
LogFreqN
-0.15
0.07
-0.27, -0.04
Context
0.04
0.03
-0.01, 0.10
Context x LogFreqPP
-0.05
0.03
-0.10, 0.00
Context x LogFreqN
0.00
0.03
-0.04, 0.05
Context x Time
-0.02
0.01
-0.03, 0.00
LogFreqPP x Time
0.01
0.03
-0.01, 0.03
LogFreqN x Time
-0.01
0.01
-0.03, 0.01
LogFreqPP x LogFreqN
-0.01
0.04
-0.10, 0.07
Intercept
Note. Significant effects are printed in bold.
Figure 2.2 Scatterplot of the log-transformed corpus frequency of the PP and its
mean familiarity rating.
the mixed-effect models. The order of the Context-block and the No Context-block
did not have a significant effect on judgments (B = 0.00; SE = 0.01; 95% CI = 0.01, 0.01).
6
R2m (Marginal R_GLMM²) represents the variance explained by fixed effects;
R2c (Conditional R_GLMM²) is interpreted as variance explained by both fixed and
random effects (i.e. the entire model).
24
Chapter 2
First of all, the model shows an effect of LOGFREQPP. Log-transformed frequency
of the phrase in CGN significantly predicted judgments, with higher frequency
leading to higher familiarity ratings, as can be observed from Figure 2.2.
Figure 2.2 also shows certain differences between items that were presented
in a sentence (orange triangles) and items that were presented as isolated word
strings (blue dots). For low-frequency phrases, providing a context tended to
heighten the ratings; in the middle part of the frequency range, there is very little
difference between +Context and –Context items; and for high-frequency phrases,
adding a context slightly lowered the ratings. However, these differences were not
pronounced enough for the interaction between CONTEXT and LOGFREQPP to be
significant (note that the confidence interval for the CONTEXT x LOGFREQPP
interaction is [-0.10, 0.00]).
A factor that did prove to have a significant effect is LOGFREQN. Higher
frequency of the noun resulted in lower familiarity ratings for the prepositional
phrase. While significant, this effect was not as strong as that of phrase
frequency. Figure 2.3 shows the mean familiarity ratings in relation to the logtransformed frequency of the noun. Note that higher noun frequency often entails
higher phrase frequency. While the former results in lower ratings, the latter leads
to higher ratings. Since phrase frequency has a stronger effect than noun
frequency, one cannot observe a clear descending line in Figure 2.3.
Figure 2.3 Scatterplot of the log-transformed corpus frequency of the N and the
mean familiarity rating of the PP as a whole.
2.3.2 Stability of familiarity ratings
To examine the stability of the familiarity judgments, we calculated the correlation
between the ratings assigned at Time 1 and those assigned at Time 2. When
averaging over participants, the ratings are highly stable. Mean ratings were
computed for each of the 88 items at Time 1, and likewise at Time 2. The
Stability of familiarity judgments
25
correlation between these two sets of mean ratings is nearly perfect (Pearson’s r
= .97).
Comparisons of individual participants’ ratings at Time 1 and Time 2 show a
rather different picture. For each participant we computed the correlation between
that person’s judgments at Time 1 and that person’s judgments at Time 2. This
yielded 86 correlation scores that range from -.13 to .87, with a mean correlation
of .52 (SD = .20). This means that none of the participants is as stable in their
ratings as the aggregated ratings are, and some participants (N = 5 with
correlations < .10) show very little if any correlation with their own ratings, i.e. their
ratings at Time 2 do not correlate at all with the ratings on the same items, with
the same instructions and under the same circumstances a few weeks earlier. 7
Two-thirds of the participants had self-correlation scores between .32 and .70.
Figure 2.4 shows the distribution of the correlations of our 86 participants.
Figure 2.4
7
Distribution of participants’ correlation of their
own ratings (Pearson’s r, Time 1 – Time 2).
These five participants with T1-T2 correlations <.10 stand out. We identified
these participants and examined their judgments in more detail. This is discussed
in relation to Figure 2.7 below.
26
Chapter 2
If there are stable individual differences, participants’ ratings at Time 1 should be
more similar to their own ratings at Time 2 than to the other participants’ ratings
at Time 2.8 We compared each participant’s self-correlation to the correlation
between that person’s ratings at Time 1 and the group mean at Time 2 by means
of the procedure described by Field (2013: 287). For 16 participants, selfcorrelation was significantly higher than correlation with the group mean; for 17
participants correlation with the group mean was significantly higher than selfcorrelation; for 53 participants there was no significant difference between the
two measures.
In order to determine if familiarity ratings were stable for certain items more
so than for others, we used the Δ-scores (see Section 2.2.5 Data transformations
and Section 2.2.6 Statistical analyses). Figure 2.5 shows for each item the mean
log-transformed Δ-score. The lower this Δ-score, the more stable the judgments
were. A Δ-score of 0.02 (meaning very little difference between the ratings at Time
1 and Time 2) corresponds to a log-transformed Δ-score of -3.91. As can be
observed from Figure 2.5, none of the items approaches the value -3. This
indicates that none of the items elicited stable ratings from all participants.
Figure 2.5 Scatterplot of the log-transformed corpus frequency of the PP and its
mean log-transformed absolute Δ-score.
We analyzed the log-transformed Δ-scores using linear mixed-models. The
resulting model is summarized in Table 2.2 (confidence intervals were obtained
via parametric bootstrapping over 100 iterations). Only LOGFREQPP proved to have
8
I would like to thank an anonymous reviewer for this suggestion.
Stability of familiarity judgments
27
a significant effect. Higher phrase frequency led to less instability in judgment.
The variance explained by this model is 14% (R2m = .01, R2c = .14). In comparison
to the relation between frequency and judgment, the relation between frequency
and instability is less strong.
Table 2.2
Estimated coefficients, standard errors, and 95% confidence intervals
for the mixed-model fitted to the log-transformed absolute Δ-scores.
b
SE b
95 % CI
Intercept
-0.95
0.05
-1.07, -0.84
LogFreqPP
-0.14
0.04
-0.21, -0.07
LogFreqN
0.04
0.04
-0.03, 0.11
Context
-0.01
0.03
-0.05, 0.03
Context x LogFreqPP
0.02
0.02
-0.01, 0.06
Context x LogFreqN
-0.01
0.02
-0.05, 0.03
LogFreqPP x LogFreqN
0.00
0.03
-0.04, 0.05
Note. Significant effects are printed in bold.
In sum, both phrase frequency and noun frequency proved significant predictors
of familiarity judgments. Embedding the phrases in a sentence did not have a
significant effect on the familiarity ratings. Regarding the stability of judgments
we observed that, as a group, the participants provide a very stable pattern of
familiarity ratings: the overall rankings at Time 1 and Time 2 correlate nearly
perfectly. As soon as one zooms in on individual participants, or looks at individual
items, the picture becomes less stable.
2.4
Discussion
2.4.1 Coexisting stability and instability
Our study reveals that stability is found in the average judgments; individuals’
judgments display much more variability. The picture that arises from our data is
one of two perspectives. On the one hand, there is a particularly strong correlation
between the average ratings on Time 1 and on Time 2, as well as a clear
correlation between those average familiarity ratings and log-transformed corpus
frequencies of the word strings. This is where the stability resides. On the other
hand, a large majority of the participants provided rather different ratings at Time
2 compared to Time 1; none of the participants was as stable in their ratings as
the aggregated ratings are; and no single item elicited stable ratings from all of
our participants. There is clear variation in individual ratings. While these results
28
Chapter 2
may seem to be at odds, we feel that they provide a very real portrayal of
metalinguistic representations and, possibly, linguistic representations.
One way to look at our dataset is to think of it as two photo mosaics created
at different times. Each mosaic is composed of numerous little photographs – in
our case: the 7568 ratings we collected within one session (86 participants each
rated 88 items). When you zoom out, all these different elements together yield
one picture. By having the participants rate the stimuli a second time, we obtained
a second picture. From a distance, these two pictures look very similar; as you
zoom in you will notice differences. Within one picture, you see that any given part
is composed of multiple elements that differ from each other to a greater or lesser
extent (i.e. the various degrees of inter-individual variation, described in the
double-framed box in Figure 2.6). Furthermore, when you compare the two
pictures in detail, you will find that a given individual element is not exactly the
same at T1 and T2 (i.e. the various degrees of intra-individual variation, described
in the circle in Figure 2.6).
The similarity between the two pictures that is observed when you zoom out
(i.e. the stability of the average ratings visible in the near perfect T1-T2
correlation) ties in well with the idea that the language system of a speech
community appears to be quite robust. In order to ensure intelligibility and
learnability of the language, this system must not change too much –especially
in the short space of a couple of weeks, and concerning everyday linguistic units
like the prepositional phrases tested in this study.
However, for the overall picture to be stable, it is not necessary for all of the
component parts to be constants too. This is shown in the fact that no single
participant’s ratings proved to be as stable as the average ratings, and the fact
that no single item elicited stable ratings from all participants. The fact that the
overall ratings correlated perfectly means that as some participants gave a higher
rating the second time, others gave a lower rating, such that on average an item’s
score remains the same. The individual variation does not entail overall instability.
Stability of familiarity judgments
29
Figure 2.6 Visualization of inter- and intra-individual variation by means of two
photo mosaics composed of numerous little photographs. Adapted
from Hope over Fear (2008) by C. Tsevis. Copyright holder unknown.
Retrieved
from
http://www.dripbook.com/tsevis/illustrationportfolio/barack-obamai/#288337. Adapted with permission.
2.4.2 Sources of inter- and intra-individual variation
If the overall picture is remarkably stable, what does the inter- and intra-individual
variation tell us? In the following paragraphs we explore possible causes for the
intra-individual variation from Time 1 to Time 2. One possible cause for a change
in familiarity score is a change in use: some phrases may become more familiar,
others less so. However, it is highly improbable that our participants’ use of the
prepositional phrases in question changed a lot in the course of a few weeks, let
alone that some PPs became more frequent for certain participants and less so
for others such that on average the items’ scores remained the same.
The observed variation could be noise inherent in the process of judging.
Featherston (2007) contends that each individual judgment is noisy and that
most of the differences between individuals are just error variance. Mean
judgments effectively remove this error variance, since -random- errors cancel
each other out. If you test groups, Featherston argues, you will see that groups of
respondents agree quite closely. The latter is borne out, as there is a near perfect
correlation between the average ratings on Time 1 and on Time 2. It is not selfevident, though, that all differences between individual judgments can be
30
Chapter 2
considered noise. True noise is fairly random fluctuation around the group pattern.
Why then were some participants’ judgments remarkably stable (i.e. r = .87 and r
= .80 in Figure 2.4)? And why did we observe significantly less instability for highfrequency phrases compared to lower-frequency items (Table 2)? There seems
to be more to it than just random variance. It would be interesting to determine
how much of the variation could be considered noise. One way to examine this in
future research would be to assume an identical grammar and simulate different
amounts of noise, and then compare the results with the experimental data.
One factor that may have increased noise is task-related: some of the
instability over time may well be due to the rating scale used. According to some
researchers (e.g. Weskott & Fanselow 2011), Magnitude Estimation is more likely
to produce variance than Likert scale or binary judgment tasks, due to the
increased number of response options in ME. However, several other studies (e.g.
Bard et al. 1996; Wulff 2009; Bader & Häussler 2010) provide evidence that ME
yields reliable data, not different from those of other judgments tasks, and that
inter-participant consistency is extremely high. Leaving it to participants to
construct their own response scale is a considerable advantage of ME in
judgment tasks where the construct of interest is gradient. From a usage-based
perspective of language, this is the case for most metalinguistic judgments and
especially so for familiarity ratings. It could be argued, moreover, that this selfconstrual of a rating scale involves deeper, more considered processing and
evaluation of the stimulus items. If anything, that would predict stronger memory
traces and therefore a higher correspondence in ratings between Time 1 and Time
2. A follow-up study (Chapter 3), in which the use of Magnitude Estimation and a
7-point Likert scale was compared, shows slightly less intra-individual variation
for participants using ME than for those using Likert scale ratings.
In addition to the unrestricted number of response options available to a
participant, another characteristic of ME might play a role. In Magnitude
Estimation, a rating for an item is given in comparison to a previous item. In our
data collection, the order of items was randomized automatically for each
participant both at Time 1 and at Time 2. The simple fact that a participant rated
a particular item A after item B at Time 1, but after item C at Time 2 will have
influenced the rating of item A (see also Sprouse 2011). Since the software
program did not record the orders in which the stimuli were presented, we cannot
determine how much of the variance is explained by the rating a participant
assigned to the previous stimulus. However, the presentation order is unlikely to
account for the fact that there were very large differences between participants
in the stability of their ratings (see Figure 2.4). As the experiment consisted of
two sets of 44 items that were randomized, it is improbable that for certain
participants the orders at T1 and T2 were nearly identical.
Stability of familiarity judgments
31
Another possible source for the inter- and intra-individual variation in familiarity
ratings is that participants performed the task using different strategies. It is
unlikely that respondents were simply not paying attention. The co-presence of a
research assistant encouraged them to carry out the task attentively. What is
more, the clear correlation between ratings and corpus frequencies and the
extremely stable mean ratings are not to be expected if they had not performed
the task seriously. What may be the case is that participants took different things
into account while giving scores. Hintzman (2011) claims that relationships of
repetition, exposure duration, and recency are falsely reduced to a single
underlying process: familiarity. His critique mainly concerns recognition-memory
and free recall paradigms that test people’s ability to indicate what words were in
a given list. In our study, repetition, exposure duration, and recency within the
experimental setting were practically the same for all participants. However, one
could suspect that people who provided relatively stable judgments based their
ratings on the same considerations at Time 1 and Time 2, whereas the
participants whose judgments were unstable took into account various aspects,
including ones that differed from Time 1 to Time 2. One way to explore this, is to
investigate whether the stable judgments correlate better with corpus frequencies
than the unstable judgments do.
To see whether stable people base their ratings on frequency, we plotted each
participant’s stability (ranging from -1: absolute negative correlation between
one’s ratings at Time 1 and Time 2, to +1: absolute positive correlation between
one’s ratings at Time 1 and Time 2) against the extent to which his/her scores
reflect corpus frequencies. Each circle in Figure 2.7 represents a participant. There
are five clear outliers when it comes to participants’ correlations with their own
ratings (x-axis). These five participants correlate less than .10. The correlations
between their scores and the corpus frequencies (y-axis) range from -.05 to .48.
For the other 81 participants, this correlation ranges between .21 to .66. Upon
closer inspection, only the participant with a negative self-correlation and a
negative rating-corpus frequency correlation is a true outlier. Furthermore, the
cluster of 81 participants still shows a considerable range on both axes. While
there is a relationship between the two measures (r = .29), we observe substantial
dispersal. Participants whose scores do not correlate with corpus frequencies can
still be pretty consistent and, conversely, those whose judgments are correlated
with frequency, are not necessarily consistent in their ratings.
32
Chapter 2
Figure 2.7 Scatterplot of participants’ stability in their own ratings (Pearson’s r,
Time 1 – Time 2) against the correlation between their ratings and
the log-transformed frequencies of the PPs (Pearson’s r). The extent
to which a participant’s position on the x-axis is correlated with the
position on the y-axis is r = .49 when all participants are included, and
r = .29 when the five participants whose own T1-T2 correlation is less
than .20 are excluded.
In sum, although there is noise contained in a single judgment, random noise does
not seem to account for the patterns of variation in the data. Therefore, we would
like to explore the possibility that variation may be a genuine property of one’s
metalinguistic representations and ultimately one’s linguistic representations.
2.4.3 Reconsidering the nature of metalinguistic judgments
Crucially, the observed intra-individual variation within a short period of time
prompts us to reconsider interpretations of inter-individual variation. The intraindividual variation shows that for quite a few of our participants metalinguistic
judgments are not particularly stable, and thus suggests that also one’s
(meta)linguistic representations vary. Thus, at Time 1 participant Y may assign a
higher score to a particular item than participant Z does, while at Time 2 it is the
other way around. If this intra-individual variation over time reflects the genuine
dynamism of linguistic representations, the difference between participant Y’s
Stability of familiarity judgments
33
rating and participant Z’s rating at one point in time cannot be interpreted
straightforwardly as the difference in their linguistic representations. A more
complete and more faithful impression requires multiple measurements.
Our data show a clear correlation between mean familiarity ratings and corpus
frequencies of the stimuli — both types of scores being amalgamations of data of
different people. The log-transformed frequency of the PP as well as the noun was
found to be a significant predictor of familiarity ratings. As hypothesized, higher
phrase frequency led to higher ratings. Higher frequency of the noun resulted in
lower ratings. The more frequent the noun, the more likely it is that this noun also
occurs in other phrases that are frequently used.9 Such phrases may come to
mind when rating the stimulus. If some of them are considered more familiar, the
score assigned to the stimulus is likely to be lowered.
Interestingly, phrase frequency had an effect –albeit small– on the degree of
intra-individual variation in judgments. Higher phrase frequency led to more
stability in judgment (see Figure 2.2). As for the degree of inter-individual
variation, low-frequency phrases were found to display more variation in judgment
across participants (as evidenced by higher SDs). There are several ways to
interpret these findings. In all likelihood the use of items that are infrequent in the
corpus differs more across our participants than the use of higher-frequency
items does. An item with a low corpus frequency may be fairly common for some
people, while others virtually never use it. As a result, familiarity judgments for this
item will diverge. It could also be the case that even when actual usage frequency
is comparable across participants, low-frequency items tend to yield more
variation in judgments because people differ in the number and type of
associations and exemplars that become activated much more so than for higherfrequency items.
A question awaiting further research is to what extent the degrees of inter- and
intra-individual variation vary as the design of the experimental task changes (e.g.
different instructions, other types of stimuli, or a task that measures immediate
9
This is substantiated by corpus frequencies taken from the Netherlandic Dutch
subset of SoNaR. The log-transformed frequency per million words of the phrase
in het water is 1.08. Water is a high-frequency noun (logN 2.30) that occurs in
various prepositional phrases, e.g. op het water (logPP 0.47), onder water (logPP
0.83), boven water (logPP 0.82), uit het water (logPP 0.38). A similar pattern is
observed with respect to in de hand (logPP 1.34). Hand (logN 2.70) occurs in
various prepositional phrases, e.g. aan de hand (logPP 1.79), voor de hand (logPP
1.37), uit de hand (logPP 0.38). The noun bad, by contrast, occurs less often (logN
1.56). When used together with a preposition, the phrase in bad is the most
frequent combination (logPP 0.89). Other phrases are much less frequent: uit bad
(logPP -0.30), met bad (logPP -1.08).
34
Chapter 2
language processing). Possibly, individual participants’ ratings are more stable if
the stimuli cover a larger frequency range, as the differences between the stimuli
are clearer to them. It would also be interesting to investigate the effects of
including (a) phrases that don’t occur much but plausibly could occur and (b)
phrases that don’t occur and require some thought to make sense of (cf. CaldwellHarris et al. 2012). Recall that familiarity of a word sequence is taken to rest on
frequency and similarity to other words, constructions or phrases. We would
therefore predict that phrases of type (a) are more likely to resemble phrases that
are familiar, and would receive higher ratings than phrases of type (b).
Furthermore, it remains to be seen what patterns of inter- and intra-individual
variation emerge when items other than multi-word phrases are investigated. As
we suggested before, multi-word sequences may involve more variation in usage
and perceived familiarity than single words and highly schematic constructions.
To the extent that tasks differ in the processes and knowledge they tap into, the
explanatory power of specific corpus-based measures may vary (see Wiechmann
2008 and Divjak 2016 for insightful comparisons). While phrase and noun
frequency were shown to have explanatory power for the familiarity ratings, there
is variance they could not explain, meaning that there are other variables
influencing these ratings.
In our study, we examined the factor CONTEXT, as we expected familiarity
judgments to be influenced by the number and type of usage contexts that come
to mind. More specifically, we hypothesized that the presence of a sentential
context generates an exemplar which may affect both the familiarity ratings and
the degrees of inter- and intra-individual variation. While the factor CONTEXT did not
yield a significant effect, the observed trend is in the expected direction. Most
items at the lower end of our frequency scale were rated higher when presented
with rather than without a context (see Figure 2.2). It is likely that participants
have more difficulty to come up with exemplars for low- than for higher-frequency
phrases. When the phrase is embedded in a sentence, the participant is offered
an exemplar of the item in use. If this exemplar is considered recognizable —that
is, if it activates memory traces of very similar usage— it will heighten the sense
of familiarity. In the mid- and high-frequency range, there is very little difference
between the ratings assigned to +Context and –Context items. Given that
participants have more experiences with higher-frequency phrases, it will be easier
for them to think of exemplars. As we strove to formulate prototypical contexts,
the given sentence is likely to resemble the exemplars participants were thinking
of. In those cases, adding a context does not alter judgments. With a view to the
comparability of the results of different judgment tasks and their meaningfulness
and generalizability, it could be considered reassuring that the presence or
Stability of familiarity judgments
35
absence of a sentential context does not appear to yield significantly different
judgments.
The fact that providing a context did not influence the degrees of inter- and
intra-individual variation is puzzling though. We predicted that adding a context
would result in less variation in judgments, as the context steers what sense is
evoked. This was not borne out by the data: making participants focus on the
same kind of usage context did not systematically reduce variation in judgment
across participants or over time. To adequately explain this observation requires
a more elaborate investigation of the factor CONTEXT. Possibly, participants differ
in how appropriate or prototypical they considered a given context to be. This may
be related to differences in their linguistic experiences, or it may involve other
factors. It would be interesting to explore effects of context in more detail by using
larger text fragments, systematically varying their prototypicality, and using a
think-aloud protocol to gain insight into participants’ associations and
considerations.
2.4.4 Conclusion
This article started by pointing out that judgments are often used as sources in
linguistic research, while, really, there is much that we do not know regarding the
reliability of these judgments. For now, our findings encourage us to continue
using judgment data, but to see these data in a slightly different light. These
judgment scores can be investigated at various levels of granularity. Aggregated
means reflect perhaps not so much an idealized speaker/hearer in the Chomskian
sense, but the overall stability of the language system in a speech community.
The individual variation can be revealing as well; variation and instability is likely
to be a genuine characteristic of (meta)linguistic representations.
We recommend, first and foremost, that researchers reflect on and be explicit
about their object of interest: mental representations that speakers have and/or
the systematicities and patterns in a language as spoken in a speech community.
Depending on one’s research focus, confining oneself to average scores or a
single measurement may result in an incomplete and oversimplified picture.
Average scores and corpus frequencies are only adequate if your unit of analysis
is at the community level. Crucially, the cognitive linguistic framework urges
researchers to expand their investigations beyond this level. Given that people’s
representations of language are mental entities, cognitive linguists cannot restrict
themselves to aggregated data. Furthermore, from a usage-based perspective,
both inter- and intra-individual variation are core characteristics of language.
If cognitive linguists take their usage-based principles seriously, they ought to
pay more attention to variation both in their research design and in the analysis
and interpretation of their data. Therefore, multiple measurements should become
36
Chapter 2
the norm rather than the exception. They are necessary in order to get a reliable
picture of the dynamism of linguistic representations. This is in keeping with
observations that the activation and processing of linguistic units can vary from
one moment to the other. As Dąbrowska (2014: 646) puts it: “(…) even the same
speaker may assemble the same utterance using different chunks on different
occasions, depending on, for example, which units have been primed by prior
discourse. This flexibility helps to explain the speed of language processing: we
save time by opportunistically using whichever chunks are most accessible at the
time of the speech event.”
Importantly, variation may be more than something that is caused by
communicative demands and affordances. It may be more than ‘noise’ in
performance disturbing our view of competence. Perhaps the underlying linguistic
representations, too, are more variable than commonly assumed. While the
dynamic nature of representations lies at the heart of usage-based approaches, it
is as yet not clear how much variability is to be expected within different time
frames, for specific and more schematic units. A better understanding of patterns
of variation will contribute to a more adequate model of linguistic representations.
At the moment, we cannot be certain to what extent the variability of
metalinguistic judgments reflects the variability of linguistic representations. Still,
even if factors such as priming and salience are the causes of the observed
variability, that momentary linguistic experience (i.e. the language that is
produced, perceived and/or judged) is taken to exert some influence on one’s
representations. How strongly and directly prior and recent experiences influence
linguistic representations, and how precisely metalinguistic judgments and
processing measures reflect these representations, are questions awaiting further
research.
Stability of familiarity judgments
37
38
Chapter 3
Abstract
In a usage-based framework, variation is part and parcel of our linguistic
experiences, and therefore also of our mental representations of language. In this
paper, we bring attention to variation as a source of information. Instead of
discarding variation as mere noise, we examine what it can reveal about the
representation and use of linguistic knowledge. By means of metalinguistic
judgment data, we demonstrate how to quantify and interpret four types of
variation: variation across items, participants, time, and methods. The data
concern familiarity ratings assigned by 91 native speakers of Dutch to 79 Dutch
prepositional phrases such as in de tuin ‘in the garden’ and rond de ingang
‘around the entrance’. Participants performed the judgment task twice within a
period of one to two weeks, using either a 7-point Likert scale or a Magnitude
Estimation scale. We explicate the principles according to which the different
types of variation can be considered information about mental representation, and
we show how they can be used to test hypotheses regarding linguistic
representations.
This chapter is based on:
Verhagen, V., Mos, M., Schilperoord, J., & Backus, A. (2019). Variation is
information: Analyses of variation across items, participants, time, and methods
in metalinguistic judgment data. Linguistics. Advance online publication.
https://doi.org/10.1515/ling-2018-0036
Acknowledgements
I thank Carleen Baas for her help in collecting the data and Martijn Goudbeek for
his helpful comments on the paper we submitted to Linguistics.
Variation is information
Chapter 3
39
Variation is information:
Analyses of variation across items,
participants, time, and methods in
metalinguistic judgment data
3.1 Introduction
The past decades have witnessed what has been called a quantitative turn in
linguistics (Gries 2014, 2015; Janda 2013). The increased availability of big
corpora, and tools and techniques to analyze these datasets, gave major impetus
to this development. In psycholinguistics, more attention is being paid to the
practice of performing power analyses in order to establish appropriate sample
sizes, reporting confidence intervals, and using mixed-effects models to
simultaneously model crossed participant and item effects (Cumming 2014;
Baayen et al. 2008; Maxwell et al. 2008). In research involving metalinguistic
judgments great changes occurred. As Schütze and Sprouse (2013: 30) remark,
“the majority of judgment collection that has been carried out by linguists over
the past 50 years has been quite informal by the standards of experimental
cognitive science”. Theorizing was commonly based on the relatively
unsystematic analysis of judgments by few speakers (often the researchers
themselves) on relatively few tokens of the structures of interest, expressed by
means of a few response categories (e.g. “acceptable”, “unacceptable”, and
sometimes “marginal”). This practice has been criticized on various accounts
(e.g. Dąbrowska 2010; Featherston 2007; Gibson & Federenko 2010, 2013;
Wasow & Arnold 2005), which led to inquiries involving larger sets of stimuli, larger
numbers of participants, and/or multiple test sessions. An unavoidable
consequence is that the range of variation that is measured increases
tremendously. Whenever research involves multiple measurements, there is
bound to be variation in the data that cannot be accounted for by the independent
variables. Various stimuli instantiating one underlying structure might receive
different ratings; different people may judge the same item differently; a single
informant might respond differently when judging the same stimulus twice. A
question that then requires attention is: what to make of the variability that is
observed? In this paper, we attempt to strike a balance between variation that is
‘noise’ and variation that is information, and we attempt to lay out the principles
underlying this balance. Four types of variation will be discussed: variation across
items, variation across participants, variation across time, and variation across
assessment methods. We will explicate the principles according to which these
40
Chapter 3
types of variation can be considered informative, and we will show how to
investigate this by means of a metalinguistic judgment task and corpus data.
First of all, there may be variation across items that are intended to measure
the same construct (see Cronbach 1951 on Cronbach’s alpha, H. Clark 1973 on
the language-as-fixed-effect fallacy, and Baker & Seock-Ho 2004 on Item
Response Theory and the Rasch model). If these stimuli yield different outcomes,
this could lead to a better understanding of the influence of factors other than the
independent variables under investigation. For example, acceptability judgments
may appear to be affected by lexical properties in addition to syntactic ones. More
and more researchers realize the importance of including multiple stimuli to
examine a particular construct and inspecting any possible variation across these
items (e.g. Featherston 2007; Gibson & Federenko 2010, 2013; Wasow & Arnold
2005).
Secondly, when an item is tested with different participants, hardly ever will
they all respond in exactly the same manner. While it has become fairly common
to collect data from a group of participants, there is no consensus on what
variation across participants signifies. The way this type of variation is
approached and the extent to which it plays a role in research questions and
analyses depends, first and foremost, on the researcher’s theoretical stance.
If one assumes, as generative linguists do, that all adult native speakers
converge on the same grammar (e.g. Crain & Lillo-Martin 1999: 9; Seidenberg
1997: 1600), and it is this grammar that one aims to describe, then individual
differences are to be left out of consideration. An important distinction, in this
context, is that between competence and performance. Whenever the goal is to
define linguistic competence, this competence can only be inferred from
performance. When people apply their linguistic knowledge –be it in spontaneous
language use or in an experimental setting– this is a process that is affected by
memory limitations, distractions, slips of the tongue and ear, etc. As a result, we
observe variation in performance. In this view, variation is caused by extraneous
factors, other than competence, and therefore it is not considered to be of interest.
In Chomsky’s (1965: 3) words: “Linguistic theory is concerned primarily with an
ideal speaker-listener, in a completely homogeneous speech-community, who
knows its language perfectly and is unaffected by such grammatically irrelevant
conditions as memory limitations, distractions, shifts of attention and interest,
and errors (random or characteristic) in applying his knowledge of the language
in actual performance.”
Featherston (2007), a proponent of this view, explicitly states that variation in
judgment data is noise inherent in the process of judging. Consequently, one
should not compare individuals’ judgments. As he puts it: “each individual brings
their own noise to the comparison, and their variance in each judgement may be
Variation is information
41
in opposite directions” (pp.284-285). As a result, individuals’ judgments seem to
differ considerably, while most of the difference is just error variance.
Featherston’s advice is to collect judgments from different participants and to
average these ratings. In this way, “the errors cancel each other out and the
judgements cluster around a mean, which we can take to be the ‘underlying’ value,
free of the noise factor” (p.284).
A rather different approach to variation between speakers can be observed in
sociolinguistics and in usage-based theories of language processing and
representation. In these frameworks, variation is seen as meaningful and
theoretically relevant. Characteristic of sociolinguistics is “the recognition that
much variability is structured rather than random” (Foulkes 2006: 649). Whereas
Featherston argues that variation is noise, Foulkes (2006: 654) makes a case for
variability not to be seen as a nuisance but as a universal and functional design
feature of language. Three waves of variation studies in sociolinguistics have
contributed to this viewpoint (Eckert 2012). In the first wave, launched by Labov
(1966), large-scale survey studies revealed correlations between linguistic
variables (e.g. the realizations of a certain phoneme, the use of a particular word)
and macro-sociological categories of socioeconomic class, sex, ethnicity, and
age. The second wave employed ethnographic methods to explore the local
categories and configurations that constitute these broader categories. The third
wave zooms in on individual speakers in particular contexts to gain insight into
the ways variation is used to construct social meaning. It is characterized by a
move from the study of structure to the study of practice, which tends to involve
a qualitative rather than quantitative approach.
A question high on the agenda is how these strands of knowledge about
variability can be unified in a theoretical framework (Foulkes 2006: 654). Usagebased approaches to language processing and cognitive linguistic
representations show great promise. As Backus (2013: 23) remarks: “a usagebased approach (…) can provide sociolinguistics with a model of the cognitive
organization of language that is much more in line with its central concerns
(variation and change) than the long-dominant generative approach was (cf.
Kristiansen & Dirven 2008).”
From a usage-based perspective, variation across speakers in linguistic
representations and language processing is to be expected on theoretical
grounds. In contrast to generative linguistics, usage-based theories hold that
competence cannot be isolated from performance; competence is dynamic and
inextricably bound up with usage. Our linguistic representations are form-meaning
pairings that are taken to emerge from our experience with language together with
general cognitive skills and processes such as schematization, categorization and
chunking (Barlow & Kemmer 2000; Bybee 2006; Tomasello 2003). The more
42
Chapter 3
frequently we encounter and use a particular linguistic unit, the more it becomes
entrenched. As a result, it can be activated and processed more quickly, which, in
turn, increases the probability that we use this form when we want to express the
given message, making this construction even more entrenched. Language
processing is, thus, to a large extent driven by our accumulated linguistic
experiences, and each usage event adds to our mental representations, to a larger
or lesser extent depending on its salience. 10
Given that people differ in their linguistic experiences, individual differences in
(meta)linguistic knowledge and processing are to be expected on this account.
Such variation is arguably less prominent at the level of syntactic patterns
compared to lexically specific constructions. Even though people differ in the
specific instances of a schematic construction they encounter and use, they can
arrive at comparable schematic representations. Still, even in adult native
speakers’ knowledge of the passive, a core construction of English grammar,
individual differences have been observed (Street & Dąbrowska 2014).
The role of frequency in the construction and use of linguistic representations
in usage-based theories has sparked interest in variation across speakers. Various
studies (Balota et al. 2004; Caldwell-Harris et al. 2012; Dąbrowska 2008; Street &
Dąbrowska 2010, 2014; Wells et al. 2009, to name just a few) have shown groups
of participants to differ significantly in ease and speed of processing and in the
use of a wide range of constructions that vary in size, schematicity, complexity,
and dispersion. Importantly, these differences appear to be related to differences
in people’s experiences with language.
Now, given that no two speakers are identical in their language use and
language exposure, also within groups of participants variation is to be expected.
Street & Dąbrowska (2010, 2014), in their studies on education-related differences
in comprehension of the English passive construction, note that there are
considerable differences in performance within the group of less educated
participants, but they do not examine this in more detail. An interesting study that
does zoom in on individual speakers is Barlow’s (2013) investigation of the
speech of six White House Press Secretaries answering questions at press
conferences. While the content changes across the different samples and
different speakers, the format is the same. Barlow analyzed bigrams and trigrams
(e.g. well I think, if you like) and part-of-speech bigrams (e.g. first person plural
10
The importance of accumulated linguistic experiences in the construction of
cognitive representations is acknowledged in various fields of research, for
example in work on the categorization of sounds (e.g. Goudbeek et al. 2009; Kuhl
2000).
Variation is information
43
personal pronoun + verb). He found individual differences, not just in the use of a
few idiosyncratic phrases but in a wide range of core grammatical constructions.
As Barlow (2013) used multiple speech samples from each press secretary,
taken over the course of several months, he was able to examine variation
between and within speakers. He observed that the inter-speaker variability was
greater than the intra-speaker variability, and the frequency of use of expressions
by individual speakers diverged from the average. Barlow thus exemplifies one
way of investigating the third type of variation: variation across time.
If you collect data from a language user on a particular linguistic item at
different points in time, you may observe variation from one moment to the other.
The degree of variation will depend on the type of item that is investigated and on
the length of the interval. For various types of items there are clear indications of
change throughout one’s life, as language acquisition, attrition, and training
studies show (e.g. Baayen et al. 2017; De Bot & Schrauf 2009; N. Ellis 2002). While
this may seem self-evident with respect to neologisms, and words and phrases
that are part of a register one becomes familiar with or ceases to use, change has
also been observed for other aspects of language. Eckert (1997) and Sankoff
(2006), for instance, describe how speakers' patterns of phonetic variation can
continue to change throughout their lifetime.
Also in a much shorter time frame, the use of a linguistic item by a single
speaker may vary. Case studies involving relatively spontaneous speech, as well
as large-scale investigations involving elicited speech, demonstrate an array of
structured variation available to an individual speaker. This variation is often
related to stylistic aspects, audience design, and discourse function. Labov (2001:
438-445) describes how the study of the speech of one individual in a range of
situations shows clear differences in the vowels’ formant values depending on
the setting. Sharma (2011) compares two sets of data from a young British-born
Asian woman in Southall: data from a sociolinguistic interview and self-recorded
interactional data covering a variety of communicative settings. Sharma reports
how the latter, but not the former, revealed strategically ‘compartmentalized’
variation. The informant was found to use a flexible and highly differentiated
repertoire of phonetic and lexical variants in managing multiple community
memberships. The variation observed may follow from deliberate choices, as well
as automatic alignment mechanisms (Garrod & Pickering 2004).
Variation within a short period of time need not always involve differences in
style and setting. Sebregts (2015) reports on individual speakers varying between
different realizations of /r/ within the same communicative setting and the same
linguistic context. He conducted a large-scale investigation into the sociophonetic,
44
Chapter 3
geographical, and linguistic variation found with Dutch /r/.11 In 10 cities in the
Netherlands and Flanders, he asked approximately 40 speakers per city to
perform a picture naming task and to read aloud a word list. The tasks involved
43 words that represent different phonological contexts in which /r/ occurs.
Sebregts observed interesting patterns of variation between and within
participants. In each of the geographical communities, there were differences
between the individual speakers, some of them realizing /r/ in a way that is
characteristic of another community. Furthermore, speaker-internal variation was
found to be high. In part, this variation was related to the phonological
environment in which /r/ appeared. In addition, participants seemed to have
different variants at their disposal for the realization of /r/ in what were essentially
the same contexts. Some Flemish speakers, for example, alternated between
alveolar and uvular r within the same linguistic context, in the course of a fiveminute elicitation task.
As Sebregts made use of two types of tasks –picture naming and word list
reading– he examined not just variation across items, participants, and time, but
also possible variation across methods. In his study, there were no significant
differences in speakers' performance between the two tasks. His tasks thus
yielded converging evidence: the results obtained via one method were confirmed
by those collected in a different way. This increases the reliability of the findings.
If there were to be differences, these are at least as important and interesting.
Different types of data may display meaningful differences as they tap into
different aspects of language use and linguistic knowledge. Methods can thus
complement each other and offer a fuller picture (e.g. Chaudron 1983; Flynn 1986;
Nordquist 2009; Schönefeld 2011; Kertész et al. 2012).
A growing number of studies combine various kinds of data (see Arppe et al.
2010; Gilquin & Gries 2009; Hashemi & Babaii 2013 for examples and critical
discussions of the current practices). Some investigations make use of
fundamentally different types of data. For instance, quantitative data can be
complemented with qualitative data, to gain an in-depth understanding of
particular behavior. An often-used combination is that of corpus-based and
experimental evidence, to investigate how frequency patterns in spontaneous
speech correlate with processing speed or metalinguistic judgments (e.g. Mos et
al. 2012). Alternatively, two versions of the same experimental task can be
administered, to assess possible effects of the design. For example, participants
may be asked to express judgments on different kinds of ratings scales (e.g. a
11
Note that the /r/ sound may be more naturally variable than many other sounds.
As Sebregts (2015: 1) remarks: “The realisation of /r/ in Dutch is a particularly
striking example of multidimensional variability”.
Variation is information
45
binary scale, a Likert scale, and an open-ended scale constructed in Magnitude
Estimation), to see whether the scales differ in perceived ease of use and
expressivity, and in the judgment data they provide (e.g. Bader & Häussler 2010;
Langsford et al. 2018; Preston & Colman 2000).
In sum, there are various indications that there is meaningful variation in the
production and perception of language, and that this variation can inform theories
on language processing and linguistic representations. We will demonstrate how
to measure the different types of variation, and how to determine which variation
can be considered informative. We do this by investigating metalinguistic
judgments in combination with corpus frequency data. Judgment tasks form an
often-used method in linguistics. They enable researchers to gather data on
phenomena that are absent or infrequent in corpora. Furthermore, in comparison
to psycholinguistic processing data, untimed judgments have the advantage of
hardly being affected by factors like sneezing, a lapse of attention, or unintended
distractions, as participants have ample time to reflect on the stimuli. This is not
to say that untimed judgments are not subject to uncontrolled or uncontrollable
factors at all (see for instance Birdsong 1989: 62-68), but they can form a valuable
complement to time-pressured performance data (e.g. R. Ellis 2005). Another
advantage is that it is relatively easy and cheap to conduct a judgment task with
large numbers of participants. It is therefore not surprising that countless
researchers make use of judgment data in the investigation of phenomena
ranging from syntactic patterns (e.g. Keller & Alexopoulou 2001; Meng & Bader
2000; Sorace 2000; Schütze 1996; Sprouse & Almeida 2012; Theakston 2004) to
formulaic language (e.g. N. Ellis & Simpson-Vlach 2009), collocations and
constructions (Granger 1998; Gries & Wulff 2009). Nonetheless, not much is
known about the degrees of variation in judgments – especially the variation
across participants and across time, and the extent to which this is influenced by
the design of the task. Typically, participants complete a judgment task just once,
and the reports are confined to mean ratings, averaging over participants. Some
studies (e.g. Langsford et al. 2018) do examine test-retest reliability of judgments
expressed on various scales, thus examining variation across time and across
methods, but all analyses are performed on mean ratings. We will demonstrate
how all four types of variation can be investigated in judgment data, and how they
can be used as sources of information.
3.2 Outline of the present research
To investigate variation in judgments across items, participants, time, and
methods, we had native speakers of Dutch rate the familiarity of prepositional
phrases such as in de tuin (‘in the garden’) and rond de ingang (‘around the
entrance’) twice within the space of one to two weeks, using either Magnitude
46
Chapter 3
Estimation or a 7-point Likert scale. While all phrases could potentially be used in
everyday life, they differ in the frequency with which they occur in Dutch corpora,
covering a large range of frequencies (see Section 3.3.3). The frequency of
occurrence of such word sequences has been shown to affect the speed with
which they are recognized and produced (e.g. Arnon & Snider 2010; Tremblay &
Tucker 2011; Chapter 4), and we expect usage frequency to be reflected in
familiarity ratings (cf. Balota et al. 2001; Popiel & McRae 1988; Shaoul et al. 2013).
Given the gradual differences in frequency of occurrence between items, the
familiarity judgments are likely to exhibit gradience as well. As we are interested
in individual differences, we opted for two rating scales that allow individual
participants to express such gradience (see Langsford et al. 2018 for a
comparison of Likert and Magnitude Estimation scales with forced choice tasks
that require averaging over participants; see Colman et al. 1997 for a comparison
of data from 5- and 7-point rating scales).
By contrasting the degree of variation across participants with the degree of
variation within participants, we can gain insight into the extent to which variation
across speakers is meaningful. Participants perform the same judgment task
twice within a time span short enough for the construct that is being tested not
to have changed much, yet long enough for the respondents not to be able to
recall the exact scores they assigned the first time. If each individual’s judgment
is fairly stable, while there is consistent variation across participants, then this
shows that there are stable differences between participants in judgment. If
individuals’ judgments are found to vary from one moment to the other, this gives
rise to another important question: Does this mean that judgments are
fundamentally noisy, or is the variability a genuine characteristic of people’s
cognitive representations, requiring to be investigated and accounted for?
In disciplines other than linguistics, there is plenty of research taking rating
scale measurements several days, weeks, or months apart (see, for instance,
Ashton 2000; Churchill & Peter 1984; Jiang & Cillessen 2005; Paiva et al. 2014;
VanGeest et al. 2002). Also in linguistics there are a number of studies in which
participants performed (part of) a judgment task twice, some of which show
judgments to be unstable (e.g. Birdsong 1989; R. Ellis 1991; Johnson et al. 1996;
Tabatabaei & Dehghani 2012). Most of this research has been conducted with
second language learners. Important to note is that these studies offered few
response options (either binary, or acceptable/unacceptable/unsure), and the
stimuli consisted of sentences. This likely influences the stability of the
judgments. A binary response scale may not fit well with people’s perceptions of
acceptability. As Birdsong (1989: 166) puts it: “Not all grammatical sentences are
perceived as equally ‘good’, and not all ungrammatical sentences are perceived
as equally ‘bad’” (also see Wasow & Arnold 2005). If you consider a stimulus to
Variation is information
47
be of medium acceptability, it is not surprising that you will classify it as
acceptable on one occasion and as unacceptable on another. It has been argued
that more than three response options are needed to achieve stable participant
responses (Preston & Colman 2000; Weng 2004). Furthermore, in the majority of
the test-retest studies participants were asked to judge sentences. If language
users do not store representations of entire sentences, it may be harder to assess
them in the exact same way on different occasions. Consequently, these studies
do not answer the question how much variation is to be expected when adult
native speakers perform the same metalinguistic judgment task twice within a
couple of weeks, rating phrases that may be used in everyday life on a scale that
allows for more fine-grained distinctions.
The set-up of our study enabled us to compare the variation across
participants with the variation across time, and to relate each of these to corpusbased frequencies of the phrases. In addition, we examined variation across
methods. To be precise, we measured the four types of variation discussed in
Section 3.1 and used those to test four hypotheses regarding linguistic
representations and metalinguistic knowledge and to answer an as yet open
question with respect to the variation across rating methods.
Hypothesis I
Variation across items correlates with corpus frequencies
Rated familiarity indexes the extent and type of previous experience someone has
had with a given stimulus (Gernsbacher 1984; Juhasz et al. 2015). If you are to
judge the familiarity of a word string, your assessment is taken to rest on
frequency and similarity to other words, constructions, or phrases (Bybee 2010:
214). Therefore, participants’ ratings are expected to correlate with corpus
frequencies – not perfectly, though, since a corpus is not a perfect representation
of an individual participant’s linguistic experiences. So, the first hypothesis will be
borne out if variation across items is found that can be predicted largely from the
independent variable: corpus frequencies.
Hypothesis II
Variation across participants is smaller for high-frequency
phrases than for low-frequency phrases
The more frequent the phrase, the more likely that it is known to many people.
The use of words tends to be ‘bursty’: when a word has occurred in a text, you
are more likely to see it again in that text than if it had not occurred (Altmann et
al. 2011; Church & Gale 1995). The occurrences of stimuli with low corpus
frequencies are likely to be clustered in a small number of texts. As such, they
may be fairly common for some people, while others virtually never use it.
Consequently, familiarity ratings for these phrases will differ more across
participants.
48
Chapter 3
Hypothesis III Variation across time is smaller for high-frequency phrases than
for low-frequency phrases
In judging familiarity, a participant will activate potential uses of a given stimulus.
The number and kinds of usage contexts and the ease with which they come to
mind influence familiarity judgments. The item’s frequency may affect the ease
with which exemplars are generated. For low-frequency phrases, the number and
type of associations and exemplars that become activated are likely to differ more
from one moment to the other, resulting in variation in judgments across time.
Hypothesis IV The variation across participants is larger than the variation
across time
For this study’s set of items and test-retest interval, the variation in judgment
across participants is expected to be larger than the variation within one person’s
ratings across time. As the phrases may be used in everyday life, the raters had
at least 18 years of linguistic experiences that have contributed to their familiarity
with these word strings. From that viewpoint, two weeks is a relatively short time
span, and there is no reason to assume that the use of the word combinations
under investigation, or participants’ mental representations of these linguistic
units, changed much in two weeks.
Question
To what extent is there variation across rating methods?
As for possible variation across rating methods, different hypotheses can be
formulated. Magnitude Estimation (ME) differs from Likert scales in that it offers
distinctions in ratings that are as fine-grained as participants’ capacities allow
(Bard et al. 1996). Participants create their own scale of judgment, rather than
being forced to use a scale with a predetermined, limited number of values of
which the (psychological) distances are unknown. According to some researchers
(e.g. Weskott & Fanselow 2011), Magnitude Estimation is more likely to produce
large variance than Likert scale or binary judgment tasks, due to the increased
number of response options. However, several other studies (e.g. Bader &
Häussler 2010; Bard et al. 1996; Wulff 2009) provide evidence that Magnitude
Estimation yields reliable data, not different from those of other judgments tasks,
and that inter-participant consistency is extremely high.
One could even argue that judgments expressed by means of Magnitude
Estimation will display less variation across time than Likert scale ratings. As ME
allows participants to distinguish as many degrees of familiarity as they feel
relevant, there is likely to be a better match between perceived familiarity and the
ratings one assigns (cf. Preston & Colman 2000). A participant may have the
feeling that the level of familiarity of an item corresponds to 4.5 on a 7-point scale,
Variation is information
49
but this is not a valid response option on this scale. It is very well possible that
this participant then rates the item as 4 on one occasion and as 5 on another
occasion. If participants are free to choose the number of degrees that are
distinguished, they can assign the rating 4.5 on both occasions. Moreover, the
self-construal of a rating scale may involve more conscious processing and
evaluation of the stimulus items. This could lead to stronger memory traces and
therefore a higher correspondence in ratings across time.
3.3
Method
3.3.1 Design
In order to examine degrees of variation in familiarity judgments for prepositional
phrases with a range in frequency, and the influence of using a Likert vs a
Magnitude Estimation scale, a 2 (Time) x 2 (RATINGSCALE) design was used. 91
participants rated 79 items twice within the space of one to two weeks. As can
be observed from Table 3.1, half of the participants gave ratings on a 7-point Likert
scale at Time 1; the other half used Magnitude Estimation. At Time 2, half of the
participants used the same scale as at Time 1, and the other half was given a
different scale. This allowed us to investigate variation across items, across
participants, across time, and across methods.
Table 3.1
The number of participants that took part in the four experimental
conditions.
Rating scale at Time 1
Rating scale at Time 2
Participants
N
Likert
Likert
24
Likert
Magnitude Estimation
22
Magnitude Estimation
Likert
22
Magnitude Estimation
Magnitude Estimation
23
3.3.2 Participants
The group of participants consisted of 91 persons (63 female, 28 male), mean
age 27.1 years (SD = 11.9, age range: 18 - 70). The four conditions did not differ
in terms of participants’ age (F(3, 87) = 0.20, p = .89) or gender (χ2(3) = 1.83, p
= .63). All participants were native speakers of Dutch. A large majority (viz. 82
participants) had a tertiary education degree; 9 participants had had intermediate
vocation education. Educational background did not differ across conditions
(χ2(6) = 3.57, p = .73).
50
Chapter 3
3.3.3 Stimulus items
Participants were asked to rate 79 Prepositional Phrases (PPs) consisting of a
preposition and a noun, and in a majority of the cases an article (i.e. 52 phrases
with the definite article de; 16 with the definite article het; 11 without an article).
The items cover a wide range of frequency (from 1 to 14688) in a subset of the
corpus SoNaR consisting of approximately 195.6 million words.12 The phrases
and the frequency data can be found in Appendices 3.1 and 3.2.
The word strings were presented in isolation. Since all stimuli constitute
phrases by themselves, they form a meaningful unit even without additional
context. In a previous study into the stability of Magnitude Estimation ratings of
familiarity (Chapter 2), we investigated possible effects of context by presenting
prepositional phrases both in isolation and embedded in a sentence. The factor
CONTEXT did not have a significant effect on familiarity ratings, nor on the degrees
of variation across and within participants.
3.3.4 Procedure
The items were presented in an online questionnaire form (using the Qualtrics
software program) and this was also the environment within which the ratings
were given. The experiment was conducted via the internet.13 Participants
received a link to a website. There they were given more information about the
study and they were asked for consent. Subsequently, they were asked to provide
some information regarding demographic variables (age, gender, language
background, educational background). After that, it was explained that their task
was to indicate how familiar various word combinations are to them. In line with
earlier studies using familiarity ratings (Juhasz et al. 2015; Williams & Morris
2004), our instructions read that the more you use and encounter a particular
word combination, the more familiar it is to you, and the higher the score you
assign to it.
In the Likert scale condition, participants were presented with a prepositional
phrase together with the statement ‘This combination sounds familiar to me’
(Deze combinatie klinkt voor mij vertrouwd) and a 7-point scale, the endpoints of
which were marked by the words ‘Disagree’ and ‘Agree’ (Oneens and Eens).
Participants were shown one example. After that, the experiment started.
12
SoNaR is a balanced reference corpus of contemporary written standard Dutch
(Oostdijk et al. 2013). The subset we used consists of texts originating from the
Netherlands (143.8 million words) and texts originating either from the
Netherlands or Belgium (51.8 million words).
13
Balota et al. (2001) found that familiarity ratings from a web-based task were
strongly correlated with ratings from laboratory tasks.
Variation is information
51
When participants were to use Magnitude Estimation, they were first
introduced to the notion of relative ratings through the example of comparing the
size of depicted clouds and expressing this relationship in numbers. In a brief
practice session, participants gave familiarity ratings to word combinations that
did not comprise prepositional phrases (e.g. de muziek klinkt luid ‘the music
sounds loud’). Before starting the main experiment, they were given advice not to
restrict their ratings to the scale used in the Dutch grading system (1 to 10, with
10 being a perfect score), not to assign negative numbers, and not starting very
low, to allow for subsequent lower ratings. At the start of the experiment,
participants rated the phrase tegen de avond (‘towards the evening’). This phrase
was taken from the middle region of the frequency range, as this may stimulate
sensitivity to differences between items with moderate familiarity (Sprouse
2011). Then, they compared each successive stimulus to the reference phrase
(‘How do you rate this combination in terms of familiarity when comparing it with
the reference combination?’ Hoe scoort deze combinatie op vertrouwdheid
wanneer je deze vergelijkt met de referentiecombinatie?).
The stimuli were randomized once. The presentation order was the same for
all participants, in both sessions, to ensure that any differences in judgment are
not caused by differences in stimulus order (cf. Sprouse 2011). Midway,
participants were informed that they had completed half of the task and they were
offered the opportunity to fill in remarks and questions, just like they were at the
end of the task.
All participants completed the experiment twice, with a period of one to two
weeks between the first and second session. They knew in advance that the
investigation involved two test sessions, but not that they would be doing the
same task twice. The time interval ranged from 4 to 15 days ( M = 7, SD = 3.11).
The four experimental conditions did not differ in terms of time interval ( F(3, 87)
= 0.28, p = .84). After four days, people are not expected to be able to recall the
exact scores they assigned to each of the 79 stimuli.
3.3.5 Data transformations
For each participant, the ratings provided within one session were converted into
Z-scores to make comparisons of judgments and variation possible. By converting
into Z-scores, a score of 0 indicates that a particular item is judged by a participant
to be of average familiarity compared to the other items. For each item, Appendix
3.2 lists the mean of the Z-scores of all participants for that item, and the standard
deviation. The Z-score transformation is common in judgment studies (Bader &
Häussler 2010; Schütze & Sprouse 2013), as it involves no loss of information on
ranking, nor at the interval level. It does entail the loss of information about
absolute familiarity and developments in absolute familiarity over time that is
52
Chapter 3
present in the data from the Likert scale condition. However, absolute familiarity
is of secondary importance in this study. A direct comparison of the different
response variables, on the other hand, is at the heart of the matter, and the use of
Z-scores enables us to make such a comparison. To assess the consequences of
using Z-scores, we also performed all analyses using raw instead of standardized
Likert scores, applying mixed ordinal regression to the Likert scale data, and linear
mixed-effects models to the ME data. This did not yield substantially different
findings. We will come back to differences between Likert and ME ratings, and
advantages and disadvantages of each of those, in the discussion (Section 3.5).
To investigate variation across time, a participant’s Z-score for an item in the
second session was deducted from the score in the first session. The difference
(i.e. Δ-score) provides insight in the extent to which a participant rated an item
differently over time (e.g. if a participant’s rating for naar huis yielded a Z-score
of 1.0 in the first session, and 0.5 in the second, the Δ-score is 0.5; if it was 1.0
the first time, and 1.5 the second time, the Δ-score is also 0.5, as the variation
across time is of the same magnitude). Given that participants who used
Magnitude Estimation constructed a scale at Time 1 and a new one at Time 2,
ratings had to be converted into Z-scores at Time 1 and Time 2 separately.
Consequently, we cannot determine whether participants might have considered
all stimuli more familiar the second time (something which will be addressed in
Section 3.5).
In order to relate variation in judgments to frequency of the phrases, frequency
counts of the exact word string in the SoNaR-subset were queried and the
frequency of occurrence per million words in the corpus was logarithmically
transformed to base 10. The same was done for the frequency of the noun
(lemma search).14 To give an example, the phrase naar huis occurred 14,688
times, which corresponds to a log-transformed frequency score of 1.88. The
lemma frequency of the noun, which encompasses occurrences of huizen, huisje,
14
Knowledge about the patterns of co-occurrence of linguistic elements is part of
our mental representations of language. Such knowledge is taken to inform
familiarity judgments. It also enables us to generate expectations, which in turn
affects the effort it takes to process the subsequent input (Huettig 2015). Word
predictability is commonly expressed by means of the metrics entropy (which
expresses the uncertainty at position t about what will follow) and surprisal
(which expresses how unexpected the actually perceived word wt+1 is), estimated
by language models trained on text corpora (Levy 2008). Entropy and surprisal
have been used successfully in models that predict speed and ease of processing
(e.g. Baayen et al. 2011; Linzen & Jaeger 2016). These metrics are not taken into
account in the present study, as we do not examine processing costs. We do so
in another paper, in which we examine individual differences in experiences,
expectations, and processing speed (Chapter 4).
Variation is information
53
huisjes in addition to huis, amounts to 84,918 instances. This corresponds to a
log-transformed frequency score of 2.64. Figure 3.1 shows the positions of the
stimuli on the phrase frequency scale and the lemma frequency scale; Appendix
3.2 lists for all stimuli the raw and the log-transformed frequencies. As can be
observed from Figure 3.1, for low-frequency PPs, the frequency of the noun varies
considerably (compare, for example, items 10 and 12). High noun frequency (like
in item 12) here indicates that the noun also occurs in phrases other than the one
we selected as a stimulus. Such phrases may come to mind when rating the
stimulus. If some of them are considered more familiar, the score assigned to the
stimulus is likely to be lowered. The high-frequency phrases in our stimulus set
have fewer ‘salient competitors’. They tend to be the most common phrase
comprising the given noun. Consider as an example the noun bad (‘bath’,
LOGFREQN 1.52). When used together with a preposition, the phrase in bad (item
54) is the most frequent combination (logFreqPP 0.81). Other phrases are much
less frequent: uit bad (logPP -0.38), met bad (logPP -1.18).
Figure 3.1 Scatterplot of the relationship between the log-transformed corpus
frequency per million words of the PP and that of the N ( r = .39). The
numbers 1 to 79 identify the individual stimuli (see Appendices).
3.3.6 Statistical analyses
Using linear mixed-effects models (Baayen et al. 2008), we investigated to what
extent the familiarity judgments can be predicted by corpus frequencies, and
whether this differs per session and/or per rating scale. Mixed-models obviate the
necessity of prior averaging over participants and/or items, enabling the
researcher to model the individual response of a given participant to a given item
(Baayen et al. 2008). Appendix 3.3 describes our implementation of this statistical
54
Chapter 3
technique (i.e. fixed effects, random effects structures, estimation of confidence
intervals). If the resulting model shows that frequency has a significant effect,
this is in line with our first hypothesis, which states that there is variation across
items in familiarity ratings that can be predicted largely from corpus frequencies.
We used standard deviation as a measure of variation across participants.
Plotting the standard deviations against the stimuli’s corpus frequencies, we
examined whether there is a relationship between phrase frequency and the
variation in judgment across participants. We hypothesized that high-frequency
phrases display less variation across participants than low-frequency phrases.
Variation across time was investigated in two ways. First, we inspected the
extent to which the judgments at Time 2 correlate with the judgments at Time 1,
by calculating the correlation between a participant’s Z-scores across sessions.
The Z-scores preserve information on ranking and on the intervals between the
raw scores. High correlation scores thus indicate that there is little variation
across time in these respects. Subsequently, we ran linear mixed-effects models
on the Δ-scores, to determine which factors influence variation across time. As
described in Section 3.3.5, the Δ-scores quantify the extent to which a
participant’s rating for a particular item at Time 2 differs from the rating at Time
1. The details of the modeling procedure are also described in Appendix 3.3. In
order for our third hypothesis to be confirmed, phrase frequency should prove to
have a significant negative effect, such that higher phrase frequency entails less
variation in judgment across time.
Then we compared the variation within participants across time with the
variation across participants. The latter was hypothesized to be larger than the
former. If that is the case, participants’ ratings at Time 1 should be more similar
to their own ratings at Time 2 than to the other participants’ ratings at Time 2. To
test this, we compared each participant’s self-correlation to the correlation
between that person’s ratings at T1 and the group mean at T2, by means of the
procedure described by Field (2013: 287).15 If the latter is significantly higher than
the former, the fourth hypothesis is confirmed.
15
Field (2013: 287) describes how one can test by means of a t-statistic (Chen &
Popovich 2002) whether a difference between two dependent correlations from
the same sample is significant. To test whether the relationship between a
participant’s scores at Time 2 (x) and that participant’s scores at Time 1 (y) is
stronger than the relationship between the group mean at Time 2 (z) and that
participant’s scores at Time 1 (y), the t-statistics is computed as:
tDifference = (rxy – rzy) * √ (((n – 3)(1+ rxz)) / (2(1 – r2xy – r2xz – r2zy + 2*rxy*rxz*rzy)))
The resulting value is checked against the appropriate critical values. For a twotailed test with 76 degrees of freedom, the critical values are 1.99 (p < .05) and
2.64 (p < .01).
Variation is information
55
In order to ascertain to what extent there is variation across rating methods,
we examined the role of the factor RATINGSCALE in the linear mixed-effects models,
and the extent to which the patterns in the standard deviations as well as the
Time1–Time2 correlations vary depending on the rating scale that is used. To
conclude that the scales yield different outcomes, the standard deviations and
correlation scores should be found to differ across methods, and/or the factor
RATINGSCALE should prove to have a significant effect, or enter into an interaction
with another factor, in the mixed-models.
3.4
Results
3.4.1 Relating familiarity judgments to corpus frequencies and rating scale
Participants discerned various degrees of familiarity. In the Likert scale conditions,
participants could distinguish maximally seven degrees. On average, they
discerned 6.4 degrees of familiarity (Likert Time 1: M = 6.3, SD = 1.2, range: 2-7;
Likert Time 2: M = 6.5, SD = 1.0, range: 2-7). In the Magnitude Estimation
conditions, participants could determine the number of response options
themselves. On average, they discerned 12.0 degrees of familiarity (ME Time 1:
M = 12.6, SD = 6.3, range: 3-35; ME Time 2: M = 11.4, SD = 4.4, range: 3-22).
From a usage-based perspective, perceived degree of familiarity is determined
to a large extent by usage frequency, which can be gauged by corpus frequencies.
By means of linear mixed-effects models, we investigated to what extent the
familiarity judgments can be predicted by the frequency of the specific phrase
(LOGFREQPP) and the lemma-frequency of the noun (LOGFREQN), and to what
degree the factors RATINGSCALE (i.e. Likert or Magnitude Estimation), Time (i.e.
first or second session), and the order in which the items were presented exert
influence. We incrementally added predictors and assessed by means of
likelihood ratio tests whether or not they significantly contributed to explaining
variance in familiarity judgments. A detailed description of this model selection
procedure can be found in Appendix 3.3. The interaction term LOGFREQPP x
LOGFREQN did not contribute to the fit of the model. Furthermore, none of the
interactions of Time and the other variables was found to improve goodness-offit. As for PRESENTATIONORDER, only the interaction with RATINGSCALE contributed to
explaining variance. The resulting model is summarized in Table 3.2. The variance
explained by this model is 57% (R2m = .36, R2c = .57).16
16
R2m (marginal R² coefficient) represents the amount of variance explained by
the fixed effects; R2c (conditional R² coefficient) is interpreted as variance
explained by both fixed and random effects (i.e. the full model) (Johnson 2014).
56
Chapter 3
Table 3.2
Estimated coefficients, standard errors, and 95% confidence intervals
for the mixed-model fitted to the standardized familiarity ratings.
B
SE b
t
0.00
0.05
0.00
LogFreqPP
0.59
0.05
10.85
LogFreqN
-0.01
0.05
-0.10
RatingScale
-0.00
0.02
-0.01
RatingScale x LogFreqPP
0.01
0.02
0.50
RatingScale x LogFreqN
0.04
0.02
1.68
PresentationOrder
-0.04
0.05
-0.80
PresentationOrder x RatingScale
-0.03
0.02
-1.46
Intercept
95 % CI
-0.10, 0.09
0.47, 0.69
-0.11, 0.10
-0.04, 0.03
-0.03, 0.05
-0.01, 0.08
-0.14, 0,05
-0.06, 0.01
Note: Significant effects are printed in bold.
The factor RATINGSCALE did not have a significant effect, indicating that familiarity
ratings expressed on a Magnitude Estimation scale do not differ systematically
from familiarity ratings expressed on a Likert scale. Furthermore, the factor
RATINGSCALE did not enter into any interactions with other factors. This means that
the role of these factors does not differ depending on the scale used.
As can be observed from Table 3.2, just one factor proved to have a significant
effect: LOGFREQPP. Only the frequency of the phrase in the corpus significantly
predicted judgments, with higher frequency leading to higher familiarity ratings,
as can be observed from Figure 3.2. This phrase frequency effect was found both
in Likert and ME ratings, at Time 1 as well Time 2.
Variation is information
Scatterplot of the logFigure 3.2
transformed corpus frequency per million
words of the PP and the standardized
familiarity ratings, split up according to
whether the ratings were expressed on a
7-point Likert scale or a Magnitude
Estimation scale. Each circle/triangle
represents one observation; the lines
represent linear regression lines with a
95% confidence interval.
57
58
Chapter 3
3.4.2 Variation across participants
Given that people differ in their linguistic experiences, familiarity with particular
word strings was expected to vary across participants, and the differences were
hypothesized to be larger in phrases with low corpus frequencies compared to
high-frequency phrases. The standard deviations listed in Appendix 3.2 quantify
per item the amount of variation in judgment across participants. Figure 3.3 plots
these standard deviations against the corpus frequencies of the phrases. Lowfrequency phrases tend to display more variation in judgment across participants
than high-frequency phrases, as evidenced by higher standard deviations. This
holds for Likert ratings more so than for ME ratings.
Figure 3.3 Scatterplots of the standard deviations in relation to the logtransformed corpus frequency per million words of the PP. The lines
represent linear regression lines with a 95% confidence interval
around it.
3.4.3 Variation across time
To examine variation across time, we calculated the correlation between the
ratings assigned at Time 1 and those assigned at Time 2. When averaging over
participants, the ratings are highly stable, regardless of the scales that were used.
Per condition, we computed mean ratings for each of the 79 items at Time 1, and
likewise at Time 2. The correlation between these two sets of mean ratings is
nearly perfect in all four conditions (see Table 3.3).
Variation is information
Table 3.3
59
Correlation of mean standardized ratings at Time 1 and Time 2
(Pearson’s r).
Time 1
Time 2
Correlation
mean ratings T1 – T2
95 % CI
Likert
Likert
ME
ME
Likert
ME
Likert
ME
.97
.96
.98
.98
.96, .98
.94, .97
.97, .98
.97, .99
We also examined the stability of individual participants’ ratings. For each
participant we computed the correlation between that person’s judgments at
Time 1 and that person’s judgments at Time 2. This yielded 91 correlation scores
that range from -.31 to .90, with a mean correlation of .70 (SD = .20). The four
conditions do not differ significantly in terms of intra-individual stability (H(3) =
4.76, p = .19). If anything, the ME-ME condition yields slightly more stable
judgments than the other conditions, as can be observed from Table 3.4 and
Figure 3.4.
Table 3.4
Distribution of individual participants’ Time 1 – Time 2 correlation
(Pearson’s r) of standardized scores.
Time 1
Time 2
Average of individual
participants’ correlation (SD)
Range
Likert
Likert
ME
ME
Likert
ME
Likert
ME
.67 (.27)
.66 (.21)
.72 (.14)
.76 (.11)
-.31 – .87
-.01 – .86
.38 – .87
.45 – .90
There are three participants whose ratings at Time 2 do not correlate at all with
their ratings on the same items, with the same instructions and under the same
circumstances a few weeks earlier (r < .20). Two of them were part of the LikertLikert group; one of them belonged to the Likert-ME group.17 The majority of the
participants had much higher scores, though, and this holds for all conditions. In
total, 7.7% of the participants (N = 7) had self-correlation scores ranging from .20
to .50; 34.1% (N = 31) had scores ranging from .51 to .75; 54.9% (N = 50) had
17
Low self-correlation scores are not related to educational background. The three
participants with self-correlation scores below .20 had intermediate vocational
education, higher vocational education, and higher education. As regards the
group with self-correlation scores ranging from .20 to .49, one participant had
intermediate vocational education, and the others had a tertiary education degree.
60
Chapter 3
scores ranging from .76 to .90. Still, none of the participants is as stable in their
ratings as the aggregated ratings presented in Table 3.3.
Figure 3.4 Boxplot of participants’ correlation of their own standardized ratings
(Pearson’s r, Time 1 – Time 2).
3.4.4 Variation across time vs. variation across participants
If participants’ ratings at Time 1 are more similar to their own ratings at Time 2
than to the other participants’ ratings at Time 2, this indicates that the variation
across participants is larger than variation across time. We compared each
participant’s self-correlation to the correlation between that person’s ratings at
T1 and the group mean at T2 (following Field 2013: 287). For 8 participants, selfcorrelation was significantly higher than correlation with the group mean; for 19
participants correlation with the group mean was significantly higher than selfcorrelation; for 64 participants there was no significant difference between the
two measures. All experimental conditions showed a similar pattern in this
respect.
3.4.5 Variation across time in relation to corpus frequencies and rating scale
In order to determine if familiarity ratings were stable for certain items more so
than for others, or for one rating scale more so than for the other, we analyzed
the Δ-scores using linear mixed-models (see Sections 3.3.5 and 3.3.6). To be
precise, we investigated to what extent variation across time is related to
Variation is information
61
frequency of the phrase and the noun and to the rating scales used at Time 1 and
Time 2.18 The resulting model is summarized in Table 3.5.
Table 3.5
Estimated coefficients, standard errors, and 95% confidence intervals
for the mixed-model fitted to the log-transformed absolute Δ-scores.
b
SE b
t
95 % CI
-1.31
0.10
-12.63
-1.51, -1.10
LogFreqPP
-0.26
0.06
-4.34
-0.37, -0.14
RatingScaleT1
0.04
0.12
0.33
-0.20, 0.28
RatingScaleT2
0.18
0.12
1.52
-0.06, 0.41
LogFreqPP x RatingScaleT1
0.17
0.07
2.53
0.04, 0.31
LogFreqPP x RatingScaleT2
0.09
0.07
1.43
-0.03, 0.22
Intercept
Note: Significant effects are printed in bold.
The type of scale that was used did not have a significant effect on the variation
across time. Furthermore, the interaction term RATINGSCALET1 x RATINGSCALET2
did not contribute to explaining variance in Δ-scores (see Appendix 3.3). One may
have expected ratings to be more stable if the same type of scale was used across
sessions (i.e. Likert-Likert or ME-ME, rather than Likert-ME or ME-Likert). The fact
that the interaction RATINGSCALET1 x RATINGSCALET2 did not improve model fit
shows that this was not the case.
LOGFREQPP proved to have a significant effect, and there was a significant
interaction of LOGFREQPP with RATINGSCALET1. In general, higher phrase frequency
led to less variation in judgment across time. However, the relationship between
phrase frequency and instability in judgment was not observed in all experimental
conditions (see Figure 3.5). It holds for the ratings when at Time 1 Likert-scales
were used to express familiarity (i.e. the two plots on the left in Figure 3.5).
18 As was reported in Section 3.3.4, the phrases were presented in a fixed order,
the same for all participants. We tested whether there were effects of fatigue (e.g.
more instability towards the end of the experiment) by including the factor
PRESENTATIONORDER in the mixed-effects models. Neither PRESENTATIONORDER, nor
any of the interactions of PRESENTATIONORDER and the other predictors was found
to improve model fit (see Appendix 3.3).
62
Chapter 3
Scatterplot of
Figure 3.5
the log-transformed corpus
frequency per million words of
the PP and the logtransformed absolute Δscores, per experimental
condition. Each circle represents one observation; the
linear
represent
lines
regression lines with a 95%
confidence interval around
them. Note: The lower the logtransformed Δ-score, the
more stable the judgments
were. For instance, a Δ-score
of 0.02 (meaning very little
the
between
difference
ratings at Time 1 and Time 2)
corresponds to a logtransformed Δ-score of -3.91.
Variation is information
63
3.5 Discussion
For a long time, variation has been overlooked, ignored, looked at from a limited
perspective (e.g. variation being simply the result of irrelevant performance
factors), or considered troublesome in various fields of linguistics. The variation
observable in metalinguistic performance made Birdsong (1989: 206-207)
wonder, rather despairingly: “Should we throw up our hands in frustration in the
face of individual, task-related, and situational differences, or should we blithely
sweep dirty data under the rug of abstraction?” Our answer to that question is:
neither of those. We argue that it is both feasible and valuable to study different
types of variation. Such investigations yield a more accurate presentation of the
data, and they contribute to the refinement of theories of linguistic knowledge. To
illustrate this, we had native speakers of Dutch rate the familiarity of a large set
of prepositional phrases twice within the space of one to two weeks, using either
Magnitude Estimation or a 7-point Likert scale. This dataset enabled us to
examine variation across items, variation across participants, variation across
time, and variation across rating methods. We have shown how these different
types of variation can be quantified and use them to test hypotheses regarding
linguistic representations.
Our analyses indicate, first of all, that familiarity judgments form
methodologically reliable, useful data in linguistic research. The ratings we
obtained with one scale were corroborated by the ratings on the other scale (recall
that there was no main effect of the factor RATINGSCALE in the analysis of the
judgments, indicating that the ratings expressed on a Magnitude Estimation scale
did not differ systematically from the ratings expressed on a Likert scale). In
addition, there was a near perfect Time1–Time2 correlation of the mean ratings
in all experimental conditions, and the majority of the participants had high selfcorrelation scores. Furthermore, the data show a clear correlation between
familiarity ratings and corpus frequencies. As familiarity is taken to rest on usage
frequency, the ratings were hypothesized to display variation across items that
could be predicted largely from corpus frequencies (but not fully, since no corpus
can be a perfect representation of an individual participant’s linguistic
experiences, cf. Mandera et al. 2017). This prediction was borne out. Both in the
Likert and in the ME condition, at Time 1 as well as at Time 2, higher phrase
frequency led to higher familiarity ratings. These findings indicate that the
participants performed the task properly, and that the tasks measured what they
were intended to measure.
In addition to variation across items, we observed variation across participants
and variation across time in familiarity ratings. These types of variation are
indicative of the dynamic nature of linguistic representations. Put differently,
variation is part of speakers’ linguistic competence. Usage-based exemplar
64
Chapter 3
models naturally accommodate such variation (e.g. Goldinger 1996; Hintzman
1986; Pierrehumbert 2001). In these models, linguistic representations consist of
a continually updating set of exemplars that include a large amount of detail
concerning linguistic and extra-linguistic properties. An exemplar is strengthened
when more and/or more recent tokens are categorized as belonging to it.
Representations are thus dynamic and detailed, naturally embedding the variation
that is experienced.
This variation can then be exploited by a speaker in the construction of social
and geographical identities (e.g. Sebregts 2015; Sharma 2011). It can also come
to the fore unintentionally, as in familiarity judgments that differ slightly across
rating sessions. While the judgment task requires people to indicate the position
of a given item on a scale of familiarity by means of a single value, its familiarity
for a particular speaker may best be viewed as a moving target located in a region
that may be narrower or wider. In that case, there is not just one true value, but a
range of scores that constitute true expressions of an item’s familiarity. Variation
in judgment across time is not noise then, but a reflection of the dynamic
character of cognitive representations as more, or less, densely populated clouds
of exemplars that vary in strength depending on frequency and recency of use.
While a single familiarity rating can be a true score, it does not offer a complete
picture.19
This also implies that prudence is in order in the interpretation of a difference
in judgment between participants on the basis of a single measurement. Such a
difference cannot be taken as the difference in their metalinguistic
representations. Not because this difference should be seen as mere noise (as
Featherston 2007 contends), but because it portrays just part of the picture. It is
only when you take into account the range of each individual’s dynamic
representations that you arrive at a more accurate conclusion. Future research
should also look at mental representations of (partially) schematic constructions,
including syntactic patterns, using this method. In a usage-based approach, these
are assumed not to be essentially different from the lexical phrases we tested.
If you intend to measure variation across items, participants, and/or time, what
kind of instrument would be most suitable? Our investigation shows that in
several respects, Magnitude Estimation and a 7-point Likert scale yield similar
19
Smits et al. (2006) proposed with respect to speech sound representations that
they can be viewed as distributions. It would be interesting to investigate whether
this also applies to familiarity judgments. By means of an artificial language
paradigm, one would be able to control the distributional properties of the input.
If metalinguistic judgments are then collected in a repeated-measures design, one
can examine whether the judgments take the form of a distribution, and if so, to
what extent it corresponds to the distribution in the input.
Variation is information
65
outcomes. The Magnitude Estimation ratings did not differ significantly from the
ratings expressed on the Likert scale, as evidenced by the absence of an effect of
the factor RATINGSCALE in the analysis of the familiarity judgments. Both types of
ratings showed a significant effect of phrase frequency. There were no significant
differences between the scales in terms of Time1–Time2 correlations.
Nevertheless, there are certain differences between Likert and ME ratings that
deserve attention and that ought to be taken into account when selecting a
particular scale.
One such difference is the possibility to determine whether participants
consider the majority of items to be familiar (or unfamiliar). If most items receive
a rating of 5 or more on a 7-point scale, this indicates that they are perceived as
fairly familiar. ME data only show to what extent particular stimuli are rated as
more familiar than others; they do not provide any information as to how familiar
that is in absolute terms.
Another difference concerns the possibility to determine whether participants
consider the entire set of stimuli more familiar the second time, as a result of the
exposure in the test sessions. The method of Magnitude Estimation entails that
the raw scores from different sessions cannot be compared directly, as a
participant may construct a new scale at each occasion. Consequently, a score
of 50 assigned by someone at Time 2 does not necessarily mean the same as a
score of 50 assigned by that participant at Time 1: at Time 2 that participant’s
scale could range from 50 upwards, while 50 may have represented a relatively
high score on that same person’s ME scale at Time 1. Magnitude Estimation
therefore requires raw scores to be converted into Z-scores for each session
separately. If all items are considered more familiar at Time 2, while the range of
the scores and the ranking of the items remain the same across sessions, the Zscores at Time 1 and Time 2 will be the same. When participants use the same
fixed Likert scale on both occasions, the researcher is better able to compare the
raw scores directly. Although there is no guarantee that a participant interprets
and uses the Likert scale in exactly the same way on both occasions, any changes
are arguably limited in scope. A Likert scale thus allows you to examine whether
all stimuli received a higher rating in the second session, provided that there is no
ceiling effect preventing increased familiarity to be expressed for certain items. If
such an analysis is of importance in your investigation, a Likert scale with a
sufficient number of response options may be more useful than Magnitude
Estimation. For the participants who were assigned to the Likert-Likert condition,
we conducted this additional analysis, calculating Δ-scores on the basis of the
raw Likert scores. This yielded 1896 Δ-scores. 48.7% of those equaled zero,
meaning that a participant assigned exactly the same Likert score to a particular
stimulus at Time 1 and Time 2. A further 30.6% consisted of a difference in rating
66
Chapter 3
across time of maximally one point on a 7-point Likert scale; 10.5% involved a
difference of two points. The remaining 10.2% of the Δ-scores comprised a
difference of more than two points. In 31.5% of the cases, a stimulus was rated
(slightly) higher at Time 1 than at Time 2; in 19.8% of the cases, a stimulus was
rated (slightly) higher at Time 2 than at Time 1.
If a researcher decides to use a Likert scale, it would be advisable to carefully
consider the number of response options. When offered the opportunity to
distinguish more than seven degrees of familiarity, participants in our study did
so in the vast majority (83.3%) of the cases. The extent to which participants
would like a scale to be fine-grained may depend on the construct that is being
measured. If prior research offers little insight in this respect, researchers could
conduct a pilot study using scales that vary in number of response options.
One more difference we observed between the ME scale and the Likert scale
concerns the effect of phrase frequency on variation across participants and
variation across time. In Likert ratings, these types of variation were more
pronounced in low-frequency items than in high-frequency ones. This effect did
not occur in the Magnitude Estimation ratings. While there may be explanations
for the susceptibility of Likert ratings to variation among low-frequency stimuli,
this is not an intentional effect of the Likert scale as a measuring instrument, and
one should be aware that it might not be observed when a different type of scale
is used. To fully understand this difference between Magnitude Estimation and
Likert scales, more research is needed using participants whose experience with
particular stimuli is known to vary. In any case, Weskott and Fanselow’s (2011)
suggestion that Magnitude Estimation judgments are more liable to producing
variance than Likert ratings is contested by our data.
As we make a case for variation to be seen as a source of information, it
remains for us to answer the question: in which cases is variation really spurious?
We suggest that in untimed metalinguistic judgments variation is hardly ever
noise. A typo gone unnoticed (e.g. ‘03’ instead of ‘30’) could be considered noise;
if participants had another look, they would identify it as a mistake and correct it.
In the unfortunate case that participants get bored, they might assign random
scores to finish as quickly as possible. Crucially, in both cases, the ratings entered
are in effect no real judgments. All variation in actual judgments stems from
characteristics of language use and linguistic representations, and is therefore
theoretically interesting. This is not to say that there will be no unexplained
variance in the data. But instead of representing noise, this variance is information
waiting to be interpreted. There are factors that have not yet been identified as
relevant, as a result of which they are neither controlled for nor included in the
analyses, or that we have not yet been able to operationalize. To cite Birdsong
(1989: 69) once more: “Metalinguistic data are like 25-cent hot dogs: they contain
Variation is information
67
meat, but a lot of other ingredients, too. Some of these ingredients resist ready
identification. (…) linguistic theorists are becoming alert to the necessity of
knowing what these ingredients are.” Ignoring the variation present in the data
will most certainly not enhance our understanding of these ‘other ingredients’ and
the way they play a part in the representation and use of linguistic knowledge. Let
us explore the opportunities analyses of variance offer and realize the full
potential.
68
Chapter 4
Abstract
While theories on predictive processing posit that predictions are based on one’s
prior experiences, experimental work has effectively ignored the fact that people
differ from each other in their linguistic experiences and, consequently, in the
predictions they generate. We examine usage-based variation by means of three
groups of participants (recruiters, job-seekers, and people not (yet) looking for a
job), two stimuli sets (word sequences characteristic of either job ads or news
reports), and two experiments (a completion task and a Voice Onset Time task).
We show that differences in experiences with a particular register result in
different expectations regarding word sequences characteristic of that register,
thus pointing to differences in mental representations of language. Subsequently,
we investigate to what extent different operationalizations of word predictability
are accurate predictors of voice onset times. A measure of a participant’s own
expectations proves to be a significant predictor of processing speed over and
above word predictability measures based on amalgamated data. These findings
point to actual individual differences and highlight the merits of going beyond
amalgamated data. We thus demonstrate that is it feasible to empirically assess
the variation implied in usage-based theories, and we advocate exploiting this
opportunity.
This chapter is based on:
Verhagen, V., Mos, M., Backus, A. & Schilperoord, J. (2018). Predictive language
processing revealing usage-based variation. Language and Cognition, 10(2), 329–
373. https://doi.org/10.1017/langcog.2018.4
Acknowledgements
I thank Louis Onrust and Antal van den Bosch for their help in analyzing the corpus
data, Sanneke Vermeulen for her help in collecting the experimental data, and
Elaine Francis and two reviewers for their helpful comments on the paper we
submitted to Language and Cognition.
Prediction-based processing
Chapter 4
69
Predictive language processing revealing
usage-based variation
4.1 Introduction
Prediction-based processing is such a fundamental cognitive mechanism that it
has been stated that brains are essentially prediction machines (A. Clark 2013).
Language processing is one of the domains in which context-sensitive prediction
plays an important role. Predictions are generated through associative activation
of relevant mental representations. Prediction-based processing can thus yield
insight into mental representations of language. This understanding can be
deepened by paying attention to variation across speakers. As yet, most
investigations in this field of research suffer from a lack of attention to such
variation. We will show why this is an important limitation and how it can be
remedied.
A variety of studies indicate that people generate expectations about
upcoming linguistic elements and that this affects the effort it takes to process
the subsequent input (see Huettig 2015; Kuperberg & Jaeger 2016; Kutas, DeLong
& Smith 2011 for recent overviews). One of the types of knowledge that can be
used to generate expectations is knowledge about the patterns of co-occurrence
of words, which is mainly based on prior experiences with these words. To date,
word predictability has been expressed as surprisal based on co-occurrence
frequencies in corpus data, or as cloze probability based on completion task data.
Predictive language processing, then, is usually demonstrated by relating surprisal
or cloze probability to an experimental measure of processing effort, such as
reaction times. If a word’s predictability is determined by the given context and
stored probabilistic knowledge resulting from cumulative exposure, surprisal or
cloze probability can be used to predict ease of processing.
Crucially, in nearly all studies to date, the datasets providing word predictability
measures come from different people than the datasets indicating performance
in processing tasks, and that is a serious shortcoming. Predictability will vary
across language users, since people differ from each other in their linguistic
experiences. The corpora that are commonly used are at best a rough
approximation of the participants’ individual experiences. Whenever cloze
probabilities from a completion task are related to reaction time data, the
experiments are conducted with different groups of participants. The studies
conducted so far offer little insight into the degrees of individual variation and
task-dependent differences, and they adopt a coarse-grained approach to the
investigation of prediction-based processing.
70
Chapter 4
The main goal of this paper is to reveal to what extent differences in experience
result in different expectations and responses to experimental stimuli, thus
pointing to differences in mental representations of language. This advances our
understanding of the theoretical status of individual variation and its
methodological implications. We use two domains of language use and three
groups of speakers that can reasonably be expected to differ in experience with
one of these domains. First, we examine the variation within and between groups
in the predictions participants generate in a completion task. Subsequently, we
investigate to what extent a participant’s own expectations affect processing
speed. If both the responses in a completion task and the time it takes to process
subsequent input are reflections of prediction-based processing, then an
individual’s performance on the processing task should correlate with his or her
performance on the completion task. Moreover, given individual variation in
experiences and expectations, a participant’s own responses in the completion
task may prove to be a better predictor than surprisal estimates based on data
from other people.
To investigate this, we conducted two experiments with the same participants
who belonged to one of three groups: recruiters, job-seekers, and people not (yet)
looking for a job. These groups can be expected to differ in experience with word
sequences that typically occur in the domain of job hunting (e.g. goede
contactuele eigenschappen ‘good communication skills’, werving en selectie
‘recruitment and selection’). The groups are not expected to differ systematically
in experience with word sequences that are characteristic of news reports (e.g.
de Tweede Kamer ‘the House of Representatives’, op een gegeven moment ‘at a
certain point’). For each of these two registers, we selected 35 word sequences
and used these as stimuli in two experiments that yield insight into participants’
linguistic representations and processing: a completion task and a Voice Onset
Time experiment.
In the following section, we discuss the concept of predictive processing in
more detail. We describe how prediction in language processing is commonly
investigated, focusing on the research design of those studies and the limitations.
We then report on the outcomes of our study into variation in predictions and
processing speed. The results show that there are meaningful differences to be
detected between groups of speakers, and that a small collection of data elicited
from the participants themselves can be more informative than general corpus
data. The prediction-based effects we observe are shown to be clearly influenced
by differences in experience. On the basis of these findings, we argue that it is
worthwhile to go beyond amalgamated data whenever prior experiences form a
predictor in models of language processing and representation.
Prediction-based processing
71
4.1.1 Prediction-based processing in language
Context-sensitive prediction is taken to be a fundamental principle of human
information processing (Bar 2007; A. Clark 2013). As Bar (2007: 281) puts it, “the
brain is continually engaged in generating predictions”. These processes have
been observed in numerous domains, ranging from the formation of first
impressions when meeting a new person (Bar, Neta & Linz 2006), to the gustatory
cortices that become active not just when tasting actual food, but also while
looking at pictures of food (Simmons, Martin & Barsalou 2005), and the
somatosensory cortex that becomes activated in anticipation of tickling, similar
to the activation during the actual sensory stimulation (Carlsson, Petrovic, Skare,
Petersoon & Ingvar 2000)
In order to generate predictions, the brain “constantly accesses information in
memory” (Bar 2007: 288), as predictions rely on associative activation. We extract
repeating patterns and statistical regularities from our environment and store
them in long-term memory as associations. Whenever we receive new input (from
the senses or driven by thought), we seek correspondence between the input and
existing representations in memory. We thus activate associated, contextually
relevant representations that translate into predictions. So, by generating a
prediction, specific regions in the brain that are responsible for processing the
type of information that is likely to be encountered are activated. The analogical
process can thus assist in the interpretation of subsequent input. Furthermore, it
can strengthen and augment the existing representations.
Expectation-based activation comes into play in a wide variety of domains that
involve visual and auditory processing (see Bar 2007; A. Clark 2013). Language
processing is no exception in this respect (see, for example, Kuperberg & Jaeger
2016). This is in line with the cognitive linguistic framework, which holds that the
capacity to acquire and process language is closely linked with fundamental
cognitive abilities. In the domain of language processing, prediction entails that
language comprehension is dynamic and actively generative. Kuperberg and
Jaeger (2016) list an impressive body of studies that provide evidence that
readers and listeners anticipate structure and/or semantic information prior to
encountering new bottom-up information. People can use multiple types of
information –ranging from syntactic, semantic, to phonological, orthographic, and
perceptual– within their representation of a given context to predictively preactivate information and facilitate the processing of new bottom-up inputs.
There are several factors that influence the degree and representational levels
to which we predictively pre-activate information (Brothers, Swaab & Traxler 2017;
Kuperberg & Jaeger 2016). The extent to which a context is constraining matters
(e.g. a context like “The day was breezy so the boy went outside to fly a…” will
pre-activate a specific word such as ‘kite’ to a higher degree than “It was an
72
Chapter 4
ordinary day and the boy went outside and saw a…”). Contexts may also differ in
the types of representations they constrain for (e.g. they could evoke a specific
lexical item, or a semantic schema, like a restaurant script). In addition to that,
the comprehender’s goal and the instructions and task demands play a role.
Whether you quickly scan, read for pleasure, or carefully process a text, may affect
the extent to which you generate predictions. Also the speed at which bottom-up
information unfolds is of influence: the faster the rate at which the input is
presented, the less opportunity there is to pre-activate information.
The contextually relevant associations that are evoked seem to be preactivated in a graded manner, through probabilistic prediction. On this account,
the mental representations for expected units are activated more than those of
less expected items (Roland, Yun, Koenig & Mauner 2012). The expected
elements, then, are easier to recognize and process when they appear in
subsequent input. When the actual input does not match the expectations, it is
more surprising and processing requires more effort.
As Kuperberg and Jaeger (2016) observe, most empirical work has focused
on effects of lexical constraint on processing. These studies indicate that a word’s
probability in a given context affects processing as reflected in reading times
(Fernandez Monsalve, Frank & Vigliocco 2012; McDonald & Shillcock 2003;
Roland et al. 2012; Smith & Levy 2013), reaction times (Arnon & Snider 2010;
Traxler & Foss 2000), and N400 effects (Brothers, Swaab & Traxler 2015; DeLong,
Urbach & Kutas 2005; Frank, Otten, Galli & Vigliocco 2015; Van Berkum, Brown,
Zwitserlood, Kooijman & Hagoort 2005). A word’s probability is commonly
expressed as cloze probability or surprisal. The former is obtained by presenting
participants with a short text fragment and asking them to fill in the blank, naming
the most likely word (i.e. a completion task or cloze procedure, W. Taylor 1953).
The cloze probability of a particular word in the given context is expressed as the
percentage of individuals that complemented the cue with that word (DeLong et
al. 2005: 1117). A word’s surprisal is inversely related, through a logarithmic
function, to the conditional probability of a word given the sentence so far, as
estimated by language models trained on text corpora (Levy 2008). Surprisal thus
expresses the extent to which an incoming word deviates from what was
predicted.
4.1.2 Usage-based variation in prediction-based processing
The measures that quantify a word’s predictability in studies to date —cloze
probabilities and surprisal estimates— are coarse-grained approximations of
participants’ experiences. The rationale behind relating processing effort to these
scores is that they gauge people’s experiences and resulting predictions. The
responses in a completion task are taken to reflect people’s knowledge resulting
Prediction-based processing
73
from prior experiences; the corpora that are used to calculate surprisal are
supposed to represent such experiences. However, the cloze probabilities and
surprisal estimates are based on amalgamations of data of various speakers, and
they are compared to processing data from yet other people. Given that people
differ from each other in their experiences, this matter should not be treated lightheartedly. Language acquisition studies have convincingly shown children’s
language production to be closely linked to their own prior experiences (e.g.
Borensztajn, Zuidema & Bod 2009; Dąbrowska & Lieven 2005; Lieven, Salomo &
Tomasello 2009). In adults, individual variation in the representation and
processing of language has received much less attention.
If we assume that prediction-based processing is strongly informed by
people’s past experiences, the best way to model processing ease and speed
would require a database with all of someone’s linguistic experiences.
Unfortunately, linguists do not have such databases at their disposal. One way to
investigate the relationship between experiences, expectations, and ease of
processing is to use groups of speakers who are known to differ in experience
with a particular register, and to compare the variation between and within the
groups. This can then be contrasted with a register with which the groups’
experiences do not differ systematically. Having participants take part in both a
task that uncovers their predictions and a task that measures processing speed
makes it possible to relate reaction times to participants’ own expectations.
A comparison of groups of speakers to reveal usage-based variation appears
to be a fruitful approach. Various studies indicate that people with different
occupations (Dąbrowska 2008; Gardner, Rothkopf, Lapan & Lafferty 1987; Street
& Dąbrowska 2010, 2014), from different social groups (Balota, Cortese, SergentMarshall, Spieler & Yap 2004; Caldwell-Harris, Berant & Edelman 2012), or with
different amounts of training (Wells, Christiansen, Race, Acheson & MacDonald
2009) vary in the way they process particular words, phrases, or (partially)
schematic constructions with which they can be expected to have different
amounts of experience. To give an example, Caldwell-Harris and colleagues
(2012) compared two groups with different prayer habits: Orthodox Jews and
secular Jews. They administered a perceptual identification task in which phrases
were briefly flashed on a computer screen, one word immediately after the other.
Participants were asked to report the words they saw, in the order in which they
saw them. As expected, the two groups did not differ from each other in
performance regarding the non-religious stimuli. On the religious phrases, by
contrast, Orthodox Jews were found to be more accurate and to show stronger
frequency effects than secular Jews. The participants who had greater experience
with specific phrases could more easily match the brief, degraded input to a
representation in long-term memory, recognize and report it. Note, however, that
74
Chapter 4
these studies do not relate the performance on the experimental tasks to any
other data from the participants themselves, and, with the exception of Street and
Dąbrowska (2010, 2014), the researchers pay little attention to the degree of
variation within each of the groups of participants.
While we would expect individual differences in experience to affect predictionbased processing, as those predictions are built on prior experience, very little
research to date has looked into this. To draw conclusions about the strength of
the relationship between predictions and processing effort, and the underlying
mental representations, we ought to pay attention to variation across language
users. This will, in turn, advance our understanding of the role of experience in
language processing and representation and the theoretical status of individual
variation.
4.2 Outline of the present research
In this paper, we examine variation between and within three groups of speakers,
and we relate the participants’ processing data to their own responses on a task
that reveals their context-sensitive predictions. Our first research question is: To
what extent do differences in amount of experience with a particular register
manifest themselves in different expectations about upcoming linguistic
elements when faced with word sequences that are characteristic of that
register? Our second research question is: To what extent do a participant’s own
responses in a completion task predict processing speed over and above word
predictability measures based on data from other people?
To investigate this, we had three groups of participants —recruiters, jobseekers, and people not (yet) looking for a job— perform two tasks —a completion
task and a Voice Onset Time (VOT) task. In both tasks, we used two sets of
stimuli: word sequences that typically occur in the domain of job hunting and word
sequences that are characteristic of news reports. In the completion task, the
participants had to finish a given incomplete phrase (e.g. goede contactuele …
‘good communication …’), listing all things that came to mind. In the VOT task,
the participants were presented with the same cues, followed by a specific target
word (e.g. eigenschappen ‘skills’), which they had to read aloud as quickly as
possible. The voice onset times for this target word indicate how quickly it is
processed in the given context.
The cue is taken to activate knowledge about the words’ co-occurrence
patterns based on one’s prior experiences. Upon reading the cue, participants thus
generate predictions about upcoming linguistic elements. In the completion task,
the participants were asked to list these predictions. The purpose of the VOT task
is to measure the time it takes to process the target word, in order to examine the
extent to which processing is facilitated by the word’s predictability. According to
Prediction-based processing
75
prediction-based processing models, the target will be easier to recognize and
process when it consists of a word that the participant expected than when it
consists of an unexpected word.
As the three groups differ in experience in the domain of job hunting,
participants’ experiences with these collocations resemble their fellow group
members’ experiences more than those of the other groups. Consequently, we
expect to see on the job ad stimuli that the variation across groups in expectations
is larger than the variation within groups. As the groups do not differ
systematically in experience with word sequences characteristic of news reports,
we expect variation across participants on these stimuli, but no systematic
differences between the groups.
Subsequently, we examine to what extent processing speed in the VOT task
correlates with participants’ expectations as expressed in the completion task.
The VOT task yields insight into the degree to which the recognition and
pronunciation of the final word of a collocation is influenced not only by the word’s
own characteristics (i.e. word length and word frequency), but also by the
preceding words and the expectations they evoke. By relating the voice onset
times to the participant’s responses on the completion task, we can investigate,
for each participant individually, how a word’s contextual expectedness affects
processing load. Various studies indicate that word predictability has an effect on
reading times, above and beyond the effect of word frequency, possibly even
prevailing over word frequency effects (Dambacher, Kliegl, Hofmann & Jacobs
2006; Fernandez Monsalve et al. 2012; Rayner, Ashby, Pollatsek & Reichle 2004;
Roland et al. 2012). In these studies, predictability was calculated on the basis of
data from people other than the actual participants. As we determine word
predictability for each participant individually, we expect our measure to be a
significant predictor of processing times, over and above measures based on data
from other people.
4.2.1 Participants
122 native speakers of Dutch took part in this study. All of them had completed
higher vocational or university education or were in the process of doing so. The
participants belong to one of three groups. The first group, labeled Recruiters,
consists of 40 people (23 female, 17 male) who were working as a recruiter,
intermediary, or HR adviser at the time of the experiment. Their ages range from
22 to 64, mean age 36.0 (SD = 10.0).
The second group, Job-seekers, consists of 40 people (23 female, 17 male)
who were selected on the basis of reporting to have read at least three to five job
advertisements per week in the three months prior to the experiment, and who
76
Chapter 4
never had a job in which they had to read and/or write such ads. Their ages range
from 19 to 50, mean age 33.8 (SD = 8.6).
The third group, labeled Inexperienced, consists of 42 students of
Communication and Information Sciences at Tilburg University (28 female, 14
male) who participated for course credit. They were selected on the basis of
reporting to have read either no job ads in the past three months, or a few but less
than one per week. Furthermore, in the past three years there was not a single
month in which they had read 25 job ads or more, and they never had a job in
which they had to read and/or write such ads. These participants’ ages range
from 18 to 26, mean age 20.2 (SD = 2.1).
4.2.2 Stimuli
The stimuli consist of 35 word sequences characteristic of job advertisements
and 35 word sequences characteristic of news reports. These word sequences
were identified by using a Job ad corpus and the Twente News Corpus, and
computing log-likelihood following the frequency profiling method of Rayson and
Garside (2000). The Job ad corpus was composed by Textkernel, a company
specialized in information extraction, web mining and semantic searching and
matching in the Human Resources sector. All the job ads retrieved in the year
2011 (slightly over 1.36 million) were compiled, yielding a corpus of 488.41 million
tokens. The Twente News Corpus (TwNC) is a corpus of comparable size (460.34
million tokens), comprising a number of national Dutch newspapers, teletext
subtitling and autocues of broadcast news shows, and news data downloaded
from the Internet (University of Twente, Human Media Interaction n.d.).20 By
means of the frequency profiling method we identified n-grams, ranging in length
from three to ten words, whose occurrence frequency is statistically higher in one
corpus than another, thus appearing to be characteristic of the former (see
Kilgarriff 2001). In order to bypass an enormous amount of irrelevant sequences
such as Contract Soort Contract and _ _ _ _ _, which occur in the headers of the
job ads, we applied the criterion that a sequence had to occur at least ten times
in one corpus and two times in the other.
20
The Twente News Corpus represents a fairly broad genre of text, to which the
three groups of participants can be presumed to have had similar exposure. The
fact that newspapers contain some job ads reflects that participants may have
had some exposure to texts of this type even if they are not actively looking for a
job or dealing with job ads professionally. The frequency with which they
encounter word sequences characteristic of job ads will be much lower, though,
than the frequency with which job-seekers and recruiters encounter them. The
word sequence “40 uur per week”, for example, occurs only 76 times in the entire
TwNC.
Prediction-based processing
77
We selected sequences that met a number of additional requirements. A string
had to end in a noun and it had to be comprehensible out of context. We only
included n-grams that constitute a phrase, with clear syntactic boundaries.
Sequences were also chosen in such a way that in the final set of stimuli all
content words occur only once.21 Furthermore, the selected sequences were to
cover a range of values on two types of corpus-based measures: sequence
frequency and surprisal of the final word in the sequence. With respect to the
former, we took into account the frequency with which the sequence occurs in
the specialized corpus (i.e. either the Job ad corpus or the News report corpus)
as well as a corpus containing generic data, meant to reflect Dutch readers’
overall experience, rather than one genre. We used a subset of the Dutch web
corpus NLCOW14 (Schäfer & Bildhauer 2012) as a generic corpus. The subset
consisted of a random sample of 8 million sentences from NLCOW14, comprising
in total 148 million words.
To obtain corpus-based surprisal estimates for the final word in the sequences,
language models were trained on the generic corpus. These models were then
used to determine the surprisal of the last word of the sequence (henceforth
target word). Surprisal was estimated using a 7-gram modified Kneser–Ney
algorithm as implemented in SRILM.22
The resulting set of stimuli and their frequency and surprisal estimates can be
found in Appendices 4.1 and 4.2. The length of the target words, measured in
number of letters, ranges from 3 to 17 (News report items M = 7.1, SD = 3.0, Job
ad items M = 8.6, SD = 3.6). Word length and frequency will be included as factors
in the analyses of the VOT data, as they are known to affect processing times.
4.2.3 Procedure
The study consisted of a battery of tasks, administered in one session.
Participants were tested individually in a quiet room. At the start of the session
they were informed that the purpose of the study was to gain insight into forms
of communication in job ads and news reports and that they would be asked to
read, complement, and judge short text fragments.
First, participants took part in the completion task in which they had to
complete the stimuli of which the final word had been omitted (see Section 4.3.1).
The only exception is the word goed ‘good’, which occurs twice.
SRILM is a toolkit for building and applying statistical language models (Stolcke
2002). Modified Kneser–Ney is a smoothing technique for language models that
not only prevents non-zero probabilities for unseen words or n-grams, but also
attempts to improve the accuracy of the model as a whole (Chen & Goodman
1999). A 7-gram model was used, since the length of the selected word strings
did not exceed seven words.
21
22
78
Chapter 4
After that, they filled out a questionnaire regarding demographic variables (age,
gender, language background) and two short attention-demanding, arithmetic
distractor tasks created using the Qualtrics software program. These tasks
distracted participants from the word sequences that they had encountered in the
completion task and were about to see again in the Voice Onset Time experiment.
After that, the VOT experiment started. In this task, participants were shown an
incomplete stimulus (i.e. the last word was omitted), and then they saw the final
word. They read aloud this target word as quickly as possible (see Section 4.4.1
for more details).
The completion task and the VOT task were administered using E-Prime 2.0
(Psychology Software Tools Inc., Pittsburgh, PA), running on a Windows
computer. To record participants’ responses, they were fitted with a headmounted microphone.
4.3
Experiment 1: Completion task
4.3.1 Method
4.3.1.1 Materials
The set of stimulus materials comprised 70 cues, divided over two ITEMTYPES: 35
Job ad cues (see Appendix 4.3) and 35 News report cues (see Appendix 4.4). A
cue consists of a test item in which the last word is replaced with three dots (e.g.
goede contactuele … ‘good communication …’). The stimuli were presented in a
random order that was the same for all participants, to ensure that any differences
between participants’ responses are not caused by differences in stimulus order.
4.3.1.2 Procedure
Participants were informed that they were about to see a series of short text
fragments. They were instructed to read them out loud and complete them by
naming all appropriate complements that immediately come to mind. For this,
they were given five seconds per trial. It was emphasized that there is not one
correct answer. In order to reduce the risk of chaining (i.e. responding with
associations based on a previous response rather than responding to the cue, see
McEvoy & Nelson 1982; De Deyne & Storms 2008), participants were shown three
examples in which the cue was repeated in every response (e.g. cue: een kopje …
‘a cup of …’, responses: een kopje koffie, een kopje thee, een kopje suiker ‘a cup
of coffee, a cup of tea, a cup of sugar’). In this way, we prompted participants to
repeat the cue every time, thus minimizing the risk of chaining.
Participants practiced with five cues that ranged in the degree to which they
typically select for a particular complement. They consisted of words unrelated to
the experimental items (e.g. een geruite … ‘a checkered …’). The experimenter
Prediction-based processing
79
stayed in the testing room while the participant completed the practice trials, to
make sure the cue was read aloud. The experimenter then left the room for the
remainder of the task, which took approximately six minutes.
The first trial was initiated by a button press from the participant. The cues
then appeared successively, each cue being shown for 5000 ms in the center of
the screen. On each trial, the software recorded a .wav file with a five-second
duration, beginning simultaneously with the presentation of the cue.
4.3.1.3 Scoring of responses
All responses were transcribed. The number of responses per cue ranged from
zero to four, and varied across items and across participants. Table 4.1 shows the
mean number of responses on the two types of stimuli for each of the groups.
Mixed ANOVA shows that there is no effect of GROUP, F(2, 119) = 0.18, p = .83,
meaning that if you consider both item-types together, there are no significant
differences across groups in mean number of responses. There is a main effect
of ITEMTYPE on the average number of responses, F(1, 119) = 38.89, p < .001, and
an interaction effect between ITEMTYPE and GROUP, F(2, 119) = 16.27, p < .001.
Pairwise comparisons (using a Šidák adjustment for multiple comparisons)
revealed that there is no significant difference between the mean number of
responses on the two types of items for Recruiters ( p = .951), while there is for
Job-seekers (p < .01) and for Inexperienced participants (p < .001). The fact that
the latter two groups listed more complements on news report items than they
did on job ad items is in line with the fact that these two groups have less
experience with Job ad phrases than with News report phrases. Note, however,
that a higher number of responses per cue does not necessarily imply a higher
degree of similarity to the complements that occur in the specialized corpora: a
participant may provide multiple complements that do not occur in the corpus.
Table 4.1
Mean number of responses participants gave per cue;
standard deviations between brackets.
News report cues
M (SD)
Job ad cues
M (SD)
Recruiters
1.12 (0.25)
1.12 (0.21)
Job-seekers
1.18 (0.31)
1.12 (0.24)
Inexperienced
1.24 (0.28)
1.06 (0.27)
By means of stereotypy points (see Fitzpatrick, Playfoot, Wray & Wright 2015) we
quantified how similar each participant’s responses are to the complements
observed in the specialized corpora. The nominal complements that occurred in
the corpus in question were assigned percentages that reflect the relative
80
Chapter 4
frequency.23 The sequence 40 uur per (’40 hours per’), for example, was always
followed by the word week (‘week’) in the Job ad corpus. Therefore, the response
week was awarded 100 points; all other responses received zero points. In
contrast, the sequence kennis en (‘knowledge and’) took seventy-three different
nouns as continuations, a few of them occurring relatively often, and most
occurring just a couple of times. Each response thus received a corresponding
amount of points. For each stimulus, the points obtained by a participant were
summed, yielding a stereotypy score ranging from 0 to 100. 24
4.3.1.4 Statistical analyses
By means of a mixed-effects logistic regression model (Jaeger 2008), we
investigated whether there are significant differences across groups of
participants and sets of stimuli in the proportion of responses that correspond to
a complement observed in the specialized corpora. Mixed-models obviate the
necessity of prior averaging over participants and/or items, enabling the
researcher to model random subject and item effects (Jaeger 2008). Appendix
4.5 describes our implementation of this statistical technique.
23
For a given cue [Cue 1], we retrieved all complements in the corpus that consist
of a noun that immediately follows the string constituting the cue. This
constitutes [Set 1]. For each complement, we determined its token frequency in
[Set 1], ignoring any variation in the use of capitals. The sum of all complements’
token frequencies is [SumFreq]. A particular complement’s stereotypy points
were calculated as follows: [complement Cn’s token frequency in Set1] /
[SumFreq] * 100. If a response in the Completion task corresponded to
complement Cn, then that response was assigned Cn‘s stereotypy points. If a
response in the Completion task did not correspond to any complement found in
the corpus, then that response was assigned zero stereotypy points.
24
Stereotypy points are related to, but not the same as, the metrics surprisal and
entropy. Entropy quantifies how uncertain the language model is about what will
come next. Entropy expresses the uncertainty at position t about what will follow;
surprisal expresses how unexpected the actually perceived word wt+1 is. As
Willems et al. (2016: 2507) explain: “if only a small set of words is likely to follow
the current context, many words will have (near) zero probability and entropy is
low”. The word that actually appears in this case may or may not be highly
surprising, depending on whether or not it conforms to the prediction. The
uncertainty about the upcoming word wt+1 does not appear to affect processing
of that word wt+1 when the effect of surprisal of wt+1 has been factored out. It is
word wt that is read more slowly when entropy(t) is higher (Frank 2013; Roark,
Bachrach, Cardenas & Pallier 2009).
Prediction-based processing
81
4.3.2 Results
For each stimulus, participants obtained a stereotypy score that quantifies how
similar their responses are to the complements observed in the specialized
corpora. Table 4.2 presents the average scores of each of the groups on the two
types of stimuli.
Table 4.2
Mean stereotypy scores (on a 0-100 scale); standard
deviations between brackets
News report stimuli
M (SD)
Job ad stimuli
M (SD)
Recruiters
31.1 (10.9)
42.0 (7.6)
Job-seekers
32.5 (5.5)
34.3 (9.5)
Inexperienced
29.5 (5.5)
18.5 (5.7)
The average scores in Table 4.2 mask variation across participants within each of
the groups (as indicated by the standard deviations) and variation across items
within each of the two sets of stimuli. Figure 4.1 visualizes for each participant
the mean stereotypy score on News report items and the mean stereotypy score
on Job ad items. It thus sketches the extent to which scores on the two item
types differ, as well as the extent to which participants within a group differ from
each other. Figure 4.2 portrays these differences in another manner; it visualizes
for each participant the difference in stereotypy scores on the two types of stimuli.
The majority of the Recruiters obtained a higher stereotypy score on Job ad
stimuli than on News report stimuli, as evidenced by the Recruiters’ marks above
the zero line. For the vast majority of the Inexperienced participants it is exactly
the other way around: their marks are predominantly located below zero. The Jobseekers show a more varied pattern, with some participants scoring higher on Job
ad items, some scoring higher on News report items, and some showing hardly
any difference between their scores on the two sets of items.
What the figures do not show is the degree of variation across items within
each of the two sets of stimuli. The majority of the Recruiters obtained a higher
mean stereotypy score on Job ad items than on News report items. Nevertheless,
there are several Job ad items on which nearly all Recruiters scored zero (see
Appendix 4.3; a group’s average stereotypy score of <10.0 indicates that most
group members received zero points on that item) and News report items on
which nearly all of them scored 100 (see Appendix 4.4, Recruiters’ average
scores >90.0).
82
Chapter 4
Figure 4.1 Mean stereotypy score on the two types of stimuli for each individual
participant.
Figure 4.2 The difference between the mean stereotypy score on Job ad stimuli
and the mean stereotypy score on News report stimuli for each
individual participant; black bars show each group’s mean difference.
A circle below zero indicates that that participant obtained higher
stereotypy scores on News report stimuli than on Job ad stimuli.
Prediction-based processing
83
By means of a mixed logit-model, we investigated whether there are significant
differences between groups and/or item types in the proportion of responses that
correspond to a complement observed in the specialized corpora while taking into
account variation across items and participants. The model (summarized in
Appendix 4.5) yielded four main findings.
First, we compared the groups’ performance on News report stimuli. The
model showed that there are no significant differences between groups in the
proportion of responses that correspond to a complement in the Twente News
Corpus. On the Job ad stimuli, by contrast, all groups differ significantly from each
other. The Recruiters have a significantly higher proportion of responses to the
Job ad stimuli that match a complement in the Job ad corpus than the
Jobseekers (β = -0.69, SE = 0.17, 99% CI: [-0.11, -0.26]). The Job-seekers, in turn,
have a significantly higher proportion of responses to the Job ad stimuli that
correspond to a complement in the Job ad corpus than the Inexperienced
participants (β = -1.69, SE = 0.25, 99% CI: [-2.34, -1.04]).
Subsequently, we examined whether participants’ performance on the Job ad
stimuli differed from their performance on the News report stimuli. The mixed
logit-model revealed that when variation across items and variation across
participants are taken into account, the difference in performance on the two
types of items does not prove to be significant for any group. However, there were
significant interactions. For the Recruiters, the proportion of responses that
correspond to a complement in the specialized corpus is slightly higher on the
Job ad items than on the News report items, while for the Job-seekers it is the
other way around. In this respect, these two groups differ significantly from each
other (β = 0.91, SE = 0.21, 99% CI: [0.36, 1.46]). For the Inexperienced participants,
the proportion of responses that correspond to a complement in the specialized
corpus is much higher on the News report items than on the Job ad items. As
such, the Inexperienced participants differ significantly from both the Job-seekers
(β = 1.23, SE = 0.32, 99% CI: [0.38, 2.07]) and the Recruiters (β = 2.14, SE = 0.38,
99% CI: [1.15, 3.09]).
4.3.3 Discussion
In this completion task, we investigated participants’ knowledge of various multiword units that typically occur in either news reports or job ads. Participants
named the complements that came to mind when reading a cue, and we analyzed
to what extent their expectations correspond to the words’ co-occurrence
patterns in corpus data.
In all three groups, and in both stimulus sets, there is variation across
participants and across items in the extent to which responses correspond to
84
Chapter 4
corpus data. Still, there is a clear pattern to be observed. On the News Report
items, the groups do not differ significantly from each other in the proportion of
responses that correspond to a complement observed in the Twente News
Corpus. On the Job ad stimuli, by contrast, all groups differ significantly. The
Recruiters’ responses correspond significantly more often to complements
observed in the Job ad corpus than the Job-seekers’ responses. The Job-seekers’
responses, in turn, correspond significantly more often to a complement in the
Job ad corpus than the responses of the Inexperienced participants.
The results indicate that there are differences in participants’ knowledge of
multi-word units which are related to their degree of experience with these word
sequences. This knowledge is the basis for prediction-based processing.
Participants’ expectations about upcoming linguistic elements, expressed by
them in the completion task, are said to affect the effort it takes to process the
subsequent input. That is, the subsequent input will be easier to recognize and
process when it consists of a word that the participant expected than when it
consists of an unexpected word. We investigated whether the data on individual
participants’ expectations, gathered in the completion task, are a good predictor
of processing speed. In a follow-up Voice Onset Time experiment, we presented
the cues once again, together with a complement selected by us. Participants
were asked to read aloud this target word as quickly as possible. In some cases,
this target word had been mentioned by them in the completion task; in other
cases, it had not. Participants were expected to process the target word faster —
as evidenced by faster voice onset times— if they had mentioned it themselves
than if they had not mentioned it.
4.4
Experiment 2: Voice Onset Time task
4.4.1 Method
4.4.1.1 Materials
The set of stimuli comprised the same 70 experimental items as the completion
task (35 Job ad word sequences and 35 News report word sequences, described
in Section 4.2.2) plus 17 filler items. The fillers were of the same type as the
experimental items (i.e. (PREPOSITION) (ARTICLE) ADJECTIVE NOUN) and consisted of
words unrelated to these items (e.g. het prachtige uitzicht ‘the beautiful view’).
The stimuli were randomized once. The presentation order was the same for all
participants, to ensure that any differences between participants’ responses are
not caused by differences in stimulus order.
Prediction-based processing
85
4.4.1.2 Procedure
Each trial began with a fixation mark presented in the center of the screen for a
duration ranging from 1200 to 3200 ms (the duration was varied to prevent
participants from getting into a fixed rhythm). Then the cue words appeared at
the center of the monitor for 1400 ms. A blank screen followed for 750 ms.
Subsequently, the target word was presented in blue font in the center of the
screen for 1500 ms. Participants were instructed to pronounce the blue word as
quickly and accurately as possible. 1500 ms after onset of the target word, a
fixation point appeared, marking the start of a new trial.
Participants practiced with eight items meant to range in the degree to which
the cue typically selects for a particular complement and in the surprisal of the
target word. The practice items consisted of words unrelated to the experimental
items (e.g. cue: een hart van ‘a heart of’, target: steen ‘stone’). The experimenter
remained in the testing room while the participant completed the practice trials,
to make sure the cue words were not read aloud, as the pronunciation might
overlap with the presentation of the target word. The experimenter then left the
room for the remainder of the task, which took approximately nine minutes.
The first trial was initiated by a button press from the participant. The stimuli
then appeared in succession. After 43 items there was a short break. The very
first trial and the one following the break were filler items. On each trial, the
software recorded a .wav file with a 1500 ms duration, beginning simultaneously
with the presentation of the target word.
All participants performed the task individually in a quiet room. The
Inexperienced group was made up of students who were tested in soundattenuated booths at the university. The Recruiters and Job-seekers were tested
in rooms that were quiet, but not as free from distractions as the booths. This
appears to have influenced reaction times: the Inexperienced participants
responded considerably faster than the other groups (see Section 4.4.2). A bysubject random intercept in the mixed-effects models accounts for structural
differences across participants in reaction times.
4.4.1.3 Data preparation and statistical analyses
Mispronunciations were discarded (e.g. stuttering re- revolutie, naming part of the
cue in addition to the target word per week, pronouncing loge (‘box’) as logé
(‘guest’) or lodge (‘lodge’)). This resulted in loss of 0.59% of the Job ad data and
1.48% of the News report data. Speech onsets were determined by analyzing the
waveforms in Praat (Boersma & Weenink 2015; Kaiser 2013: 144).
Using linear mixed-effects models (Baayen et al,. 2008), we examined whether
there are significant differences in VOTs across groups of participants and sets
of stimuli, analogous to the analyses of the completion task data. We then
86
Chapter 4
investigated to what extent the voice onset times can be predicted by
characteristics of the individual items and participants. Our main interest is to
examine the relationship between VOTs and three different measures of word
predictability. In order to assess this relationship properly, we should take into
account possible effects of word length, word frequency, and presentation order,
since these factors may influence VOTs. Therefore, we included three sets of
factors. The first set concerns features of the target word, regardless of the cue,
that are known to affect naming times: the length of the target word and its lemma
frequency. The second set relates to artifacts of our experimental design:
presentation order and block. The third set consists of the factors of interest to
our research question: three different operationalizations of word predictability.
The predictor variables are discussed in more detail successively. The details of
the modeling procedure are described in Appendix 4.6.
WORDLENGTH
Longer words take longer to read (e.g. Balota et al. 2004; Kliegl,
Grabner, Rolfs & Engbert 2004). Performance on naming tasks
has been shown to correlate more with numbers of letters than
number of phonemes (Ferrand et al. 2011) or number of
syllables (Forster & Chambers 1973). Therefore, we included
length in letters of the target word as a predictor.
rLOGFREQ
Word frequency has been shown to affect reading and naming
times (Connine, Mullennix, Shernoff & Yelen 1990; Forster &
Chambers 1973; Kirsner 1994; McDonald & Shillcock 2003;
Roland et al. 2012). It is a proxy for a word’s familiarity and
probability of occurrence without regard to context. We
determined the frequency with which the target words (lemma
search) occur in the generic corpus. This corpus comprised a
wide range of texts, so as to reflect Dutch readers’ overall
experience, rather than one genre. The frequency counts were
log-transformed. Word length and word frequency were
correlated (r = -.46), as was to be expected. Frequent words
tend to have shorter linguistic forms (Zipf 1935). We
residualized word frequency against word length, thus removing
the collinearity from the model. The resulting predictor rLOGFREQ
can be used to trace the influence of word frequency on VOTs
once word length is taken into account.
PRESENTATIONORDER As was reported in the Materials section, the stimuli were
presented in a fixed order, the same for all participants. We
Prediction-based processing
87
examined whether there were effects of presentation order (e.g.
shorter response times in the course of the experiment because
of familiarization with the procedure, or longer response times
because of fatigue or boredom), and whether any of the other
predictors entered into interaction with PRESENTATIONORDER.
BLOCK
The experiment consisted of two blocks of stimuli. Between the
blocks there was a short break. We checked whether there was
an effect of BLOCK.
Various studies indicate that word predictability has an effect on reading and
naming times (McDonald & Shillcock 2003; Fernandez Monsalve et al. 2012;
Rayner et al. 2004; Roland et al. 2012; Traxler & Foss 2000). Word predictability
is commonly expressed by means of corpus-based surprisal estimates or cloze
probabilities, using amalgamated data from different people; hardly ever is it
determined for participants individually. In our analyses, we compare the following
three operationalizations:
GENERICSURPRISAL The surprisal of the target word given the cue, estimated by
language models trained on the generic corpus meant to reflect
Dutch readers’ overall experience (see Section 4.2.2 for more
details).25
CLOZEPROBABILITY The percentage of participants that complemented the cue in
the completion task preceding the VOT task with the target
word. We allowed for small variations, provided that the words
shared their morphological stem with the target (e.g. info –
informatie).
TARGETMENTIONED A binary variable that expresses for each participant individually
whether or not a target word was expected to occur. For each
stimulus, we assessed whether the target had been mentioned
by a participant in the completion task. Again, we allowed for
small variations, provided that the words shared their stem with
the target.
25
Language models could also be trained on the specialized corpora, instead of
the generic corpus. The use of SPECIALIZEDSURPRISAL instead of GENERICSURPRISAL
would not yield different outcomes, though; there is no effect of
SPECIALIZEDSURPRISAL on VOTs (β = 0.006, SE = 0.005, 99% CI: [-0.006, 0.018]).
88
Chapter 4
To give an idea of the number of times the target words were listed in the
completion task, Table 4.3 presents the mean percentage of target words
mentioned by the participants in each of the groups.
Table 4.3
Mean percentage of targets words that had been mentioned
by the participants in the completion task; range between brackets.
Recruiters
News report stimuli
M
(range)
Job ad stimuli
M
(range)
31.4 (20.0 – 51.4)
44.0 (20.0 – 60.0)
Job-seekers
31.6 (22.9 – 45.7)
36.6 (14.3 – 62.9)
Inexperienced
28.1 (17.1 – 40.0)
19.3 ( 2.9 – 40.0)
Finally, we included interactions between rLOGFREQ and measures of word
predictability, as the frequency effect may be weakened, or even absent, when the
target is more predictable (Roland et al. 2012).
4.4.2 Results
Table 4.4 presents for each group the mean voice onset time per item type. The
Inexperienced participants were generally faster than the other groups, on both
types of stimuli. This is likely due to factors irrelevant to our research questions:
differences in experimental setting, in experience with participating in
experiments, and in age. By-subject random intercepts account for such
differences.26 Of interest to us is the way the VOTs on the two types of items relate
to each other, and the extent to which the VOTs can be predicted by measures of
word predictability. These topics are discussed successively.
26
Instead of using the mean VOT of all participants, each participant is assigned
a personal intercept value. General differences in reaction times are thus
accounted for. A participant that was relatively slow across board will have a
higher intercept value than participants that were relatively fast. Apart from that,
the participants can resemble or differ from each other in the extent to which their
VOTs show effects of the predictor variables. An alternative method of accounting
for structural differences across participants in reaction times is to standardize
the VOTs. This rules out a by-subject random intercept, since every subject has a
mean standardized VOT of zero. The outcomes of a model fitted to standardized
VOTs were found not to differ essentially from the outcomes of the model fitted
to raw VOTs. Therefore, we only report the latter.
Prediction-based processing
Table 4.4
89
Mean Voice Onset Times in seconds; standard deviations between
brackets.
News report stimuli
M (SD)
Job ad stimuli
M (SD)
Recruiters
0.541 (0.14)
0.522 (0.14)
Job-seekers
0.539 (0.15)
0.531 (0.14)
Inexperienced
0.476 (0.12)
0.486 (0.11)
Table 4.4 shows that, on average, the Inexperienced participants responded faster
to the News report stimuli than to the Job ad stimuli, while for the other groups it
is just the other way around. Figures 4.3 and 4.4 visualize the pattern between the
VOTs on the two types of items for each participant individually. For 80% of the
Recruiters, the difference in mean VOTs on the two types of stimuli is negative,
meaning that they were slightly faster to respond to Job ad stimuli than to News
report stimuli. For 62.5% of the Job-seekers and 23.8% of the Inexperienced
participants the difference score is below zero. Mixed-effects models fitted to the
voice onset times (summarized in Table 4.4 and Figures 4.3 and 4.4) revealed
that the Inexperienced participants’ data pattern is significantly different from the
Recruiters’ (β = -0.030, SE = 0.007, 99% CI: [-0.048, -0.011]) and the Job-seekers’
(β = -0.019, SE = 0.005, 99% CI: [-0.034, -0.004]). That is, the fact that the
Inexperienced participants tended to be faster on the News report items than on
the Job ad items makes them differ significantly from both the Recruiters and the
Job-seekers (see Appendix 4.6 for more details).
90
Chapter 4
Figure 4.3 Mean Voice Onset Time on the two types of stimuli for each individual
participant.
Prediction-based processing
91
Figure 4.4 The difference between the mean VOT on Job ad stimuli and the
mean VOT on News report stimuli for each individual participant;
black bars show each group’s mean difference. A circle below zero
indicates that that participant responded faster on Job ad stimuli than
on News report stimuli.
What Figures 4.3 and 4.4 do not show is the degree of variation in VOTs across
items within each of the two sets of stimuli. Every mark in Figure 4.3 averages
over 35 items that differ from each other in word length, word frequency, and word
predictability. By means of mixed-effects models, we examined to what extent
these variables predict voice onset times, and whether there are effects of
presentation order and block. We incrementally added predictors and assessed
by means of likelihood ratio tests whether or not they significantly contributed to
explaining variance in voice onset times. A detailed description of this model
selection procedure can be found in Appendix 4.6. The main outcomes are that
the experimental design variable BLOCK and the interaction term
PRESENTATIONORDER x BLOCK did not contribute to the fit of the model. The stimulusrelated variables WORDLENGTH and rLOGFREQ did contribute. As for the word
predictability measures, GENERICSURPRISAL did not improve model fit, but
CLOZEPROBABILITY and TARGETMENTIONED did. While the interaction between
rLOGFREQ and CLOZEPROBABILITY did not contribute to the fit of the model, the
92
Chapter 4
interaction between rLOGFREQ and TARGETMENTIONED did. None of the interactions
of PRESENTATIONORDER and the other variables was found to improve goodness-offit. The resulting model is summarized in Table 4.5. The variance explained by this
model is 60% (R2m = .15, R2c = .60).27
Table 4.5 presents the outcomes when Target not mentioned is used as the
reference condition. The intercept here represents the mean voice onset time
when the target had not been mentioned by participants and all of the other
predictors take their average value. A predictor’s estimated coefficient indicates
the change in voice onset times associated with every unit increase in that
predictor. The estimated coefficient of rLOGFREQ, for instance, indicates that, when
the target had not been mentioned and all other predictors take their average
value, for every unit increase in residualized log frequency, voice onset times are
12 milliseconds faster.
The model shows that CLOZEPROBABILITY significantly predicted voice onset
times: target words with higher cloze probabilities were named faster. In addition
to that, there is an effect of TARGETMENTIONED. When participants had mentioned
the target word themselves in the completion task, they responded significantly
faster than when they had not mentioned the target word (i.e. -0.055).
Lemma frequency (rLOGFREQ) proved to have an effect when the targets had
not been mentioned. When participants had not mentioned the target words in
the completion task, higher-frequency words elicited faster responses than lowerfrequency words. When the targets had been mentioned, by contrast, word
frequency had no effect on VOTs (B = -0.001; SE = 0.005; t = -0.13; 99% CI = 0.014, 0.012).
Finally, the model shows that while longer words took a bit longer to read, the
influence of word length was not pronounced enough to be significant.
Presentation order did not have an effect either, indicating that there are no
systematic effects of habituation or boredom on response times.
27 2
R m (marginal R² coefficient) represents the amount of variance explained by
the fixed effects; R2c (conditional R² coefficient) is interpreted as variance
explained by both fixed and random effects (i.e. the full model) (Johnson 2014).
**
**
**
**
93
4.60
0.003
0.011
rLogFreq x TargetMentioned=yes
Note: Significance code: 0.01 ‘**’
-4.64
-15.11
-0.055
TargetMentioned=yes
0.004
-0.025
ClozeProbability
0.005
1.31
-2.58
0.007
PresentationOrder
0.005
-0.012
rLogFreq
0.005
2.26
0.509, 0.556
-0.002, 0.027
-0.024, -0.001
-0.007, 0.020
-0.039, -0.011
-0.065, -0.046
0.005, 0.018
0.009
0.532
0.012
WordLength
0.005
99 % CI
t
59.86
SE
Estimate
(Intercept)
Table 4.5
Generalized linear mixed-effects model (family: Gaussian) fitted to the voice onset
times, using Target not mentioned as the reference condition.
Prediction-based processing
The effects of word frequency (rLOGFREQ) and TARGETMENTIONED, and the
interaction, are visualized in Figure 4.5. All along the frequency range, VOTs were
significantly faster when the target had been mentioned by the participants in the
preceding completion task. The effect of TARGETMENTIONED is more pronounced
for lower-frequency items (the distance between the red and the blue line being
larger on the left side than on the right side).
When the targets had not been mentioned, lemma frequency has an effect on
VOTs, with more frequent words being responded to faster, as indicated by the
descending red line. The effect of frequency is significantly different when the
target had been mentioned by participants. In those cases, frequency had no
impact.
94
Chapter 4
Figure 4.5 Scatterplot of the log-transformed corpus
frequency of the target word (lemma), residualized against
word length, and the Voice Onset Times, split up according
to whether or not the target word had been mentioned by
a participant in the preceding completion task. Each circle
represents one observation; the lines represent linear
regression lines with a 95% confidence interval around it.
Prediction-based processing
95
4.4.3 Discussion
By means of the Voice Onset Time task, we measured the speed with which
participants processed a target word following a given cue. Our analyses revealed
that the Inexperienced participants’ data pattern was significantly different from
the Recruiters’ and the Job-seekers’: the majority of the Recruiters and the Jobseekers responded faster to the Job ad items than to the News report items, while
it was exactly the other way around for the vast majority of the Inexperienced
participants.
In all three groups, and in both stimulus sets, there was variation across
participants and across items in voice onset times. We examined to what extent
this variance could be explained by different measures of word predictability, while
accounting for characteristics of the target words (i.e. word length and word
frequency) and the experimental design (i.e. presentation order and block). This
resulted in five main findings.
First of all, GENERICSURPRISAL, which is the surprisal of the target word given the
cue estimated by language models trained on the generic corpus, did not
contribute to the fit of the model. In other words, the mental lexicons of our
participants could not be adequately assessed by the generic corpus data. It is
quite possible that the use of another type of corpus —one that is more
representative of the participants’ experiences with the word sequences at hand—
could result in surprisal estimates that do prove to be a significant predictor of
voice onset times. It was not our goal to assess the representativeness of
different types of corpora. Studies by Fernandez Monsalve et al. (2012), Frank
(2013), and Willems, Frank, Nijhof, Hagoort, and Van den Bosch (2016) offer
insight into the ways in which corpus size and composition affect the accuracy
of the language models and, consequently, the explanatory power of the surprisal
estimates. Still, there may be substantial and systematic differences between
corpus-based word probabilities and cloze probabilities, as Smith and Levy (2011)
report, and cloze probabilities may be a better predictor of processing effort.
The second finding is that CLOZEPROBABILITY —a measure of word predictability
based on the completion task data of all 122 participants together— significantly
predicted voice onset times. Target words with higher cloze probabilities were
named faster. Combined, the first and the second finding indicate that general
corpus data is too coarse an information source for individual entrenchment, and
that the total set of responses in a completion task from the participants
themselves forms a better source of information.
Third, our variable TARGETMENTIONED had an effect on voice onset times over
and above the effect of CLOZEPROBABILITY. TARGETMENTIONED is a measure of the
predictability of a target for a given participant: if a participant had mentioned this
word in the completion task, this person was known to expect it through context-
96
Chapter 4
sensitive prediction. Participants were significantly faster to name the target if
they had mentioned it themselves in the completion task. This operationalization
of predictability differs from those in other studies in that it was determined for
each participant individually, instead of being based on amalgamated data from
other people. It also differs from priming effects (McNamara 2005; Pickering &
Ferreira 2008), which tend to be viewed as non-targeted and rapidly decaying. In
our study, participants mentioned various complements in the completion task.
Five to fifteen minutes later (depending on a stimulus’ order of presentation in
each of the two tasks), the target words were presented in the VOT task. These
targets were identical, related, or unrelated to the complements named by a
participant. The effects of completion task responses on target word processing
in a reaction time task are usually not viewed as priming effects, given the
relatively long time frame and the conscious and strategic nature of the activation
of the words given as a response (see the discussion in Kuperberg & Jaeger 2016:
40; also see Otten & Van Berkum’s 2008 distinction between discourse-dependent
lexical anticipation and priming).
Both CLOZEPROBABILITY and TARGETMENTIONED are operationalizations of word
predictability. They were found to have complementary explanatory power.
CLOZEPROBABILITY proved to have an effect when the target had not been
mentioned by a participant, as well as when the target had been mentioned. In
both cases, higher cloze probabilities yielded faster VOTs. This taps into the fact
that there are differences in the degree to which the targets presented in the VOT
task are expected to occur. A higher degree of expectancy will contribute to faster
naming times. The binary variable TARGETMENTIONED does not account for such
gradient differences. CLOZEPROBABILITY, on the other hand, may be a proxy for this;
it is likely that targets with higher cloze probabilities are words that are considered
more probable than targets with lower cloze probabilities.
Conversely, TARGETMENTIONED explains variance that CLOZEPROBABILITY does not
account for. That is, participants were significantly faster to name the target if
they had come up with this word to complete the phrase themselves
approximately ten minutes earlier in the completion task. This finding points to
actual individual differences and highlights the merits of going beyond
amalgamated data. The fact that a measure of a participant’s own predictions is
a significant predictor of processing speed over and above word predictability
measures based on amalgamated data, had not yet been shown in lexical
predictive processing research. It does fit in, more generally, with recent studies
into the processing of schematic constructions in which individuals’ scores from
one experiment were found to correlate with their performance on another task
(e.g. Misyak, Christiansen & Tomblin 2010; Misyak & Christiansen 2012).
Prediction-based processing
97
The fourth main finding is that the effect of TARGETMENTIONED on voice onset
times was stronger for lower-frequency than for higher-frequency items (the
distance between the red and the blue line in Figure 4.5 being larger on the left
side than on the right side). The high-frequency target words may be so familiar
to the participants that they can process them quickly, regardless of whether or
not they had pre-activated them. The processing of low-frequency items, on the
other hand, clearly benefits from predictive pre-activation.
Fifth, corpus-based word frequency had no effect on VOTs when the target
had been mentioned in the completion task (i.e. t=-0.13 for rLOGFREQ; the blue
‘Target mentioned’ line in Figure 4.5 is virtually flat). In other words, predictive
pre-activation facilitates processing to such an extent that word frequency no
longer affects naming latency. When participants had not mentioned the target
words in the completion task, higher-frequency words elicited faster responses
than lower-frequency words (in Table 4.5 rLOGFREQ is significant (t=-2.58); the red
‘Target not mentioned’ line in Figure 4.5 descends).
4.5 General discussion
Our findings lead to three conclusions. First, there is usage-based variation in the
predictions people generate: differences in experiences with a particular register
result in different expectations regarding word sequences characteristic of that
register, thus pointing to differences in mental representations of language.
Second, it is advisable to derive predictability estimates from data obtained from
language users closely related to the people participating in the reaction time
experiment (i.e. using data from either the participants themselves, or a
representative sample of the population in question). Such estimates form a more
accurate predictor of processing times than predictability measures based on
generic data. Third, we have shown that it is worthwhile to zoom in at the level of
individual participants, as an individual’s responses in a completion task form a
significant predictor of processing times over and above group-based cloze
probabilities.
These findings point to a continuity with respect to observations in language
acquisition research: the significance of individual differences and the merits of
going beyond amalgamated data that have been shown in child language
processing, are also observed in adults. Furthermore, our findings are fully in line
with theories on context-sensitive prediction in language processing, which hold
that predictions are based on one’s own prior experiences. Yet in practice, work
on predictive processing has paid little attention to variation across speakers in
experiences and expectations. Studies investigating the relationship between
word predictability and processing speed have always operationalized
predictability by means of corpus data or experimental data from people other
98
Chapter 4
than those taking part in the reaction time experiments. We empirically
demonstrated that such predictability estimates cannot be truly representative for
those participants, since people differ from each other in their linguistic
experiences and, consequently, in the predictions they generate. While usagebased principles of variation are endorsed more and more (e.g. Barlow & Kemmer
2000; Bybee 2010; Croft 2000; Goldberg 2006; Kristiansen & Dirven 2008; Schmid
2015; Tomasello 2003), often the methodological implications of a usage-based
approach are not fully put into practice. In this paper, we show that there is
meaningful variation to be detected in prediction and processing, and we
demonstrate that it is both feasible and worthwhile to attend to such variation.
We examined variation in experience, predictions, and processing speed by
making use of two sets of stimuli, three groups of speakers, and two experimental
tasks. Our stimuli consisted of word sequences that typically occur in the domain
of job hunting, and word sequences that are characteristic of news reports. The
three groups of speakers –viz. recruiters, job-seekers, and people not (yet) looking
for a job– differed in experience in the domain of job hunting, while they did not
differ systematically in experience with the news report register. All participants
took part in two tasks that tap into prediction-based processing. The completion
task yielded insight into what participants expect to occur given a particular
sequence of words and their previous experiences with such elements. In the
Voice Onset Time task we measured the speed with which a specific complement
was processed, and we examined the extent to which this is influenced by its
predictability for a given participant.
The data from the completion task confirmed our hypotheses regarding the
variation within and across groups in the predictions participants generate. On the
News Report items, the groups did not differ significantly from each other in how
likely participants were to name responses that correspond to the complements
observed in the Twente News Corpus. On the Job ad stimuli, by contrast, all
groups differed significantly from each other. The Recruiters’ responses
corresponded significantly more often to complements observed in the Job ad
corpus than the Job-seekers’ responses. The Job-seekers’ responses, in turn,
corresponded significantly more often to a complement in the Job ad corpus than
the responses of the Inexperienced participants. The responses thus reveal
differences in participants’ knowledge of multi-word units which are related to
their degree of experience with these word sequences.
We then investigated to what extent a participant’s own expectations influence
the speed with which a specific complement is processed. If the responses in the
completion task are an accurate reflection of participants’ expectations, and if
prediction-based processing models are correct in stating that expectations affect
the effort it takes to process subsequent input, then it should take participants
Prediction-based processing
99
less time to process words they had mentioned themselves than words they had
not listed. Indeed, whether or not participants had mentioned the target
significantly affected voice onset times. What is more, this predictive preactivation, as captured by the variable TARGETMENTIONED, was found to facilitate
processing to such an extent that word frequency could not exert any additional
accelerating influence. When participants had mentioned the target word in the
completion task, there was no effect of word frequency. This demonstrates the
impact of context-sensitive prediction on subsequent processing.
The facilitating effect of expectation-based preparatory activation was
strongest for lower-frequency items. This has been observed before, not just with
respect to the processing of lexical items (Dambacher et al. 2006; Rayner et al.
2004), but also for other types of constructions (e.g. Wells et al. 2009). It shows
that we cannot make general claims about the strength of the effect of
predictability on processing speed, as it is modulated by frequency.
Perhaps even more interesting is that the variable TARGETMENTIONED had an
effect on voice onset times over and above the effect of CLOZEPROBABILITY.
Participants were significantly faster to name the target if they had mentioned it
themselves in the completion task. This shows the importance of going beyond
amalgamated data. While this may not come across as surprising, it is seldomly
shown or exploited in research on prediction-based processing. Even with a
simple binary measure like TARGETMENTIONED, we see that data elicited from an
individual participant constitute a powerful predictor for that person’s reaction
times. If one were to develop it into a measure that captures gradient differences
in word predictability for each participant individually, it might be even more
powerful.
Our study has focused on processing of multi-word units. Few linguists will
deny there is individual variation in vocabulary inventories. In a usage-based
approach to language learning and processing, there is no reason to assume that
individual differences are restricted to concrete chunks such as words and
phrases. One interesting next step, then, is to investigate to what extent similar
differences can be observed for partially schematic or abstract patterns. Some of
these constructions (e.g. highly frequent patterns such as transitives) might be
expected to show smaller differences, as exposure differs less substantially from
person to person. However, recent studies point to individual differences in
representations and processing of constructions that were commonly assumed
to be shared by all adult native speakers of English (see Kemp, Mitchell & Bryant
2017 on the use of spelling rules for plural nouns and third-person singular present
verbs in pseudowords; Street & Dąbrowska 2010, 2014 on passives and
quantifiers). Our experimental set-up, which includes multiples tasks executed by
100
Chapter 4
the same participants, can also be used to investigate individual variation in
processing abstract patterns and constructions.
In conclusion, the results of this study demonstrate the importance of paying
attention to usage-based variation in research design and analyses – a
methodological refinement that follows from theoretical underpinnings and, in
turn, will contribute to a better understanding of language processing and
linguistic representations. Not only do groups of speakers differ significantly in
their behavior, an individual’s performance in one experiment is shown to have
unique additional explanatory power regarding performance in another
experiment. This is in line with a conceptualization of language and linguistic
representations as inherently dynamic. Variation is ubiquitous, but, crucially, not
random. The task that we face when we want to arrive at accurate theories of
linguistic representation and processing is to define the factors that determine
the degrees of variation between individuals, and this requires going beyond
amalgamated data.
101
Chapter 5 Metalinguistic judgments are
psycholinguistic data
5.1 Introduction
Can metalinguistic judgments provide insight into the degree to which linguistic
constructions are entrenched in a speaker’s mind? A central tenet of usage-based
approaches is that language users are sensitive to the distributional properties of
the language they encounter and produce. These distributional properties affect
the way a linguistic item is represented mentally, which in turn affects the
probability that the item will be used, the speed with which it is processed, and
the speaker’s metalinguistic knowledge regarding its use. From this perspective,
degrees of entrenchment of linguistic units can be derived from processing data
as well as metalinguistic judgments. On the other hand, processing tasks and
judgment tasks may well differ in the processes and knowledge they tap into.
Various linguists have voiced the suspicion that entrenchment involves processes
which are too deeply embedded for introspection. In this chapter, we present data
that contribute to a better understanding of the relationships between
metalinguistic judgments, reaction time data, completion task responses, and
corpus frequencies, and we test assumptions that follow from a usage-based
approach.
5.2
The relationship between metalinguistic judgments and mental
representations of language
Usage-based theories posit that mental representations of language emerge from
one’s experiences with language and general cognitive processes including crossmodal association, categorization, chunking, and analogy (Bybee 2010). These
linguistic representations constitute a network of constructions that vary in size
and specificity and that are entrenched to different degrees. The more a
construction is established as a cognitive routine, the more it is said to be
entrenched (Langacker 1987). Usage frequency is a crucial factor in this respect;
more experience with a particular construction makes it more strongly
entrenched. As a result, the construction can be processed more quickly, fluently,
and accurately. Alegre and Gordon (1999), Arnon and Snider (2010), Bybee
(2002), and Dąbrowska (2008, 2018), among others, have demonstrated effects
of frequency on processing with regard to constructions ranging from
morphologically complex words and four-word phrases to (partially) schematic
constructions. A question that requires closer investigation is to what extent
mental representations are accessible to language users, such that the degree to
102
Chapter 5
which linguistic constructions are entrenched in their minds manifests itself not
just in processing but also in metalinguistic judgments. In other words, are these
degrees of entrenchment part of one’s explicit knowledge and can metalinguistic
judgments be used to gain insight into entrenchment?
On the one hand, “judgments are the results of linguistic and cognitive
processes, by which people attempt to process sentences and then make
metalinguistic judgments on the results of those acts of processing (…) Thus,
they implicate the same linguistic representations involved in all acts of
processing”, as Branigan and Pickering (2017: 4) contend. On the other hand,
different kinds of linguistic activities –such as reading a text, rapidly making
choices in a lexical decision task, completing phrases, reading aloud words,
assigning familiarity ratings, making grammaticality judgments– involve different
aspects of linguistic representations and may differ in the degree to which they
appeal to particular mental representations. Judgments are said to be influenced
by knowledge and beliefs (Dąbrowska 2016a) and to reflect decision-making
biases (Branigan & Pickering 2017) which are not involved in language
processing. What is more, various researchers are concerned that introspections
cannot yield accurate insights into subconscious cognitive processes (e.g. Gibbs
2006; Roehr 2008; Stubbs 1993). To assess which aspects are accessible to
introspection, it is fruitful to compare metalinguistic judgments with processing
data. Such an approach answers Arppe et al.’s (2010:4) call for more multimethodological research to gain a better understanding of the characteristics and
restrictions of each type of evidence.
Prior research has reported correlations between familiarity ratings for various
types of lexical units (i.e. words, word pairs, phrases, idioms, and metaphors) and
other measures that may provide information on degrees of entrenchment. More
specifically, these ratings have proved to be significant predictors of reading times
(e.g. Cronk et al. 1993; Juhasz & Rayner 2003; Williams & Morris 2004),
performance on lexical decision and speeded naming tasks (e.g. Gernsbacher
1984; Connine et al. 1990; Blasko & Connine 1993; Juhasz et al. 2015), speeded
semantic judgment tasks (e.g. Tabossi et al. 2009), and perceptual identification
tasks (Caldwell-Harris et al. 2012). While these findings are insightful, they are
limited in that the sets of familiarity ratings come from different people than the
datasets indicating performance in processing tasks. Multi-method approaches
are better able to provide valid insights into task-specific characteristics if they
account for individual differences. Speakers differ from each other in their
linguistic experiences; their linguistic representations are expected to differ
accordingly. If this is the case, we cannot tell whether a discrepancy between
familiarity judgments and processing data reflects the fact that different tasks tap
into different processes and knowledge, or whether it reflects individual variation
Metalinguistic judgments are psycholinguistic data
103
in linguistic representations. By having participants who are known to differ in
experience with a particular domain of language use, perform a metalinguistic
judgment task as well as psycholinguistic processing tasks, we can differentiate
between the two.
The goal of this study is two-fold. First, it will reveal to what extent differences
in amount of experience with a particular register manifest themselves in different
familiarity judgments when faced with word sequences that are characteristic of
that register. To this end, three groups of participants –recruiters, job-seekers, and
people not (yet) looking for a job– performed a metalinguistic judgment task in
which they assigned familiarity ratings to two sets of stimuli – word sequences
characteristic of either job ads or news reports. As the three groups differ in
experience in the domain of job hunting, they are likely to differ in experience with
collocations that are typically used in that domain. According to usage-based
theories, these differences in experience lead to differences in mental
representations of language. This leads to a testable hypothesis: If familiarity
judgments give expression to linguistic representations, the ratings should reflect
these differences. That is, the Job ad stimuli ought to be most familiar to the
Recruiters and least familiar to the Inexperienced participants.
Subsequently, we examine the relationship between metalinguistic judgments
and other types of experimental data. The stimuli that were presented in the
judgment task have also been used in two other experiments conducted among
the same participants: a completion task and a Voice Onset Time experiment
(both described in Chapter 4). By analyzing the judgment data in relation to the
participants’ completion task responses, their voice onset times, and corpusbased frequencies, we can answer the second research question: To what extent
do someone’s own data from psycholinguistic processing tasks have explanatory
power in predicting familiarity judgments in addition to corpus frequencies? If the
different types of tasks tap into the same mental representations, one’s
performance in the processing tasks should be a significant predictor of one’s
familiarity ratings. If it does not prove to be a significant predictor, this means that
there are substantial differences between the tasks in the information they
provide.
5.3
Method
5.3.1 Participants
The same participants that took part in the completion task and the VOT
experiment (described in Chapter 4) performed this metalinguistic judgment task.
The sample consisted of 122 native speakers of Dutch who belonged to one of
three groups: Recruiters, Job-seekers, and Inexperienced participants. Section
104
Chapter 5
4.2.1 provides further details regarding gender, age, educational background, and
group membership criteria.
5.3.2 Stimuli
The stimuli were the word sequences also used in the completion task and the
VOT experiment. They are described in detail in Chapter 4, Section 4.2.2. The set
consists of 35 word strings characteristic of job advertisements and 35 word
strings characteristic of news reports, covering a range of phrase frequency
values. The stimuli were randomized once for this task. The presentation order
was the same for all participants, to ensure that any differences between
participants’ judgments are not caused by differences in stimulus order.
5.3.3 Judgment task
Participants were asked to rate familiarity using Magnitude Estimation (Bard,
Robertson & Sorace 1996). This type of task was also used in the studies reported
in Chapters 2 and 3. Instead of using a set judgment scale, participants build their
own scale, making as many fine-grained distinctions as they feel appropriate (see
Section 2.2.3.2 for more information).
The concept ‘familiarity’ was defined in the following way: In this task, we ask
you to judge how familiar various word combinations are to you. For every word
combination, you enter a figure. The more familiar the word combination, the
higher the figure. You can think of familiarity in the following ways: you use it
often; you hear it often; you read it often (In deze taak vragen we u te beoordelen
hoe vertrouwd verschillende woordcombinaties voor u zijn. Bij elke
woordcombinatie vult u een getal in. Hoe vertrouwder de woordcombinatie, hoe
hoger het getal. Denk bij vertrouwdheid aan: u gebruikt het vaak; u hoort het vaak;
u leest het vaak.)
5.3.4 Procedure
Participants took part in the familiarity judgment task after they had finished the
completion task and the VOT experiment. All word combinations to be judged had
occurred in these other tasks. Studies like the one by Lilly (2009) suggest that
having seen all stimuli before making any judgments might actually be beneficial
(in terms of number of revisions, participants’ perception of the reliability of their
judgments, and the correlation between judgments and objective scores in
studies for which such scores exist).
As in the judgment tasks reported in Chapters 2 and 3, the items were
presented in an online questionnaire form (using the Qualtrics software program)
and this was also the environment within which the ratings were given.
Participants were introduced to the notion of relative ratings through the example
Metalinguistic judgments are psycholinguistic data
105
of comparing the size of depicted clouds and expressing this relationship in
numbers. They were instructed to rate each stimulus relative to the preceding one.
A new stimulus was always presented together with the preceding item and the
score assigned to it. In a brief practice session, participants then gave familiarity
ratings to verb–object combinations (e.g. een aardappel poffen ‘bake a potato’).
They were advised not to start very low, in order to allow for subsequent lower
ratings; not to assign negative numbers; and not to set an upper bound a priori.28
Before starting the main experiment, participants were informed that the word
combinations to be rated had already occurred in the previous tasks. They were
asked to assess their familiarity leaving aside the occurrences in this study.
The first six stimuli covered the phrase frequency range of the entire set of
items, and the first as well as the seventh stimulus was taken from the middle
region of the frequency range, as this may stimulate sensitivity to differences
between items with moderate familiarity (Sprouse 2011). Midway, participants
were informed that they had completed half of the task and they were offered the
opportunity to fill in remarks and questions, just like they were at the end of the
task.
5.3.5 Data transformations
For each participant, the ratings were converted to Z-scores to make comparisons
of relative ratings possible, just like we did in Chapters 2 and 3. A Z-score of 0
indicates that a particular item is judged by a participant to be of average
familiarity compared to the other items. Appendices 5.1 and 5.2 list the mean of
the Z-scores of all participants for a given item, and the standard deviation.
5.3.6 Statistical analyses
First, we conducted an analysis of variance and planned contrasts to examine
whether there are significant differences in familiarity ratings across groups of
participants and sets of stimuli, analogous to the analyses of the completion task
data (Section 4.3.2) and the voice onset times (Section 4.4.2).
We then fitted linear mixed-effects models (Baayen et al. 2008), using the
LMER function from the lme4 package in R (version 3.3.3; CRAN project; R Core
Team, 2017), to the standardized familiarity ratings. We investigated to what
extent individual participants’ performance on other tasks predicts familiarity
ratings, on top of corpus frequencies. To determine this, we included three sets
of factors in the model. The first set consists of the corpus-based measures that
were also employed in the analyses of the judgment data in Chapters 2 and 3:
28
The instructions and practice items are available in DataverseNL at
https://hdl.handle.net/10411/EL6KZX.
106
Chapter 5
phrase frequency, and lemma frequency of the final word in the phrase. 29 The
second set comprises two measures based on individual participants’
performance on the preceding experimental tasks involving the same stimuli: the
measure TARGETMENTIONED, which is based on participants’ completion task
responses, and the voice onset times from the VOT experiment. The third set
comprises the factors PRESENTATIONORDER and BLOCK as artifacts of our
experimental design. The predictor variables are discussed in more detail
successively; the details of the modeling procedure are described in Appendix 5.3,
and the datasets and the scripts are available in DataverseNL at
https://hdl.handle.net/10411/EL6KZX.
The variable LOGFREQPHRASE is the log-transformed frequency with which the
phrase as a whole occurs in a subset of the Dutch web corpus NLCOW14 (Schäfer
& Bildhauer 2012) – a generic corpus, meant to reflect Dutch readers’ overall
experience, rather than one genre. The subset consisted of a random sample of 8
million sentences from NLCOW14, comprising in total 148 million words.
LOGFREQLEMMA is the log-transformed lemma frequency of the final word of the
phrase in the generic corpus.
The variable TARGETMENTIONED expresses whether or not a participant was
known to expect the final word of a stimulus to occur given the preceding words.
For each stimulus, we assessed whether the final word (i.e. the target word) had
been mentioned by a participant in the completion task. We allowed for small
variations, provided that the words shared their morphological stem with the
target (e.g. info – informatie).
The variable VOT is based on the data from the voice onset time experiment.
It is the time it took a given participant to start pronouncing the target word as
soon as it appeared on screen following the cue (i.e. the stimulus with the final
word omitted).
29
In the analyses of the voice onset times (Chapter 4) we included the factor
GENERICSURPRISAL – a measure that is derived from corpus data. It is the surprisal
of the final word given the preceding words in the phrase, estimated by language
models trained on a generic corpus. Surprisal estimates are commonly used as a
measure of word predictability and processing speed. GENERICSURPRISAL is unlikely
to be a strong predictor of perceived familiarity of the phrase as a whole once
phrase frequency has already been taken into account. We checked whether
GENERICSURPRISAL would contribute to explaining variance in familiarity ratings.
Adding GENERICSURPRISAL to the model containing LOGFREQPHRASE did not improve
model fit (χ2(1) = 0.38, p = .54). Therefore, we omitted this factor in all subsequent
analyses.
Metalinguistic judgments are psycholinguistic data
107
Finally, we examined possible effects of PRESENTATIONORDER and BLOCK. As was
reported in Section 5.3.2, the stimuli were presented in a fixed order, the same for
all participants. Halfway there was a short break.
5.4
Results
5.4.1 Variation across groups of participants and sets of stimuli
Figure 5.1 visualizes the mean familiarity rating on the two types of items for each
participant individually. Figure 5.2 depicts for each participant the magnitude of
the difference between these two scores. These figures are based on
standardized ratings. The method of Magnitude Estimation entails that the raw
scores from different participants cannot be compared directly, as participants
each construct their own scale. Consequently, a score of 50 may represent an
average degree of familiarity for one participant, while it expresses a high degree
of familiarity for another participant. Once the ratings have been standardized,
they can be compared within and between participants.
The fact that the Recruiters display lower scores on the News report phrases
than the Inexperienced participants does not mean that these phrases are less
familiar to the Recruiters than to the Inexperienced participants in absolute terms.
It does mean that the Recruiters consider the Job ad phrases to be more familiar
than the News report phrases, while for the Inexperienced participants it is the
other way around. The vast majority of the Recruiters (90%) had higher
standardized ratings on the Job ad items than on the News report items. 30 The
same holds for 77.5% of the Job-seekers and 11.9% of the Inexperienced
participants.
An analysis of variance performed on the difference scores depicted in Figure
5.2 showed that there is a significant effect of GROUP on the difference between
participants’ mean standardized familiarity rating on Job ad stimuli and their
mean standardized familiarity rating on the News report stimuli ( F(2, 71.96) =
74.49, p < .001). Planned contrasts revealed that the difference scores are
30 There is one notable exception: the Recruiter represented by the turquoise line,
whose mean standardized familiarity rating on the News report stimuli amounts
to 0.42. We inspected the scores assigned by her. She did not seem to have
reversed the scale (i.e. assigning higher ratings to less familiar items), nor did she
enter any comments indicating confusion or misunderstanding. The correlation
between her standardized ratings and the average of all other participants’
standardized ratings is -.17 (Pearson’s r). After having analyzed the complete
dataset, we also ran the model on the dataset without this participant’s data.
Excluding her ratings did not alter any of the findings. We decided to keep her
scores included, as such a deviant case may be a real characteristic of judgment
data.
108
Chapter 5
significantly lower for Inexperienced participants compared to the other groups
(t(117.64) = 12.15, p < .001, r = .75), and that the Job-seekers’ difference scores
are in turn significantly lower than the Recruiters’ (t(77.36) = 2.55, p < .05, r
= .28).31
Figure 5.1 Mean standardized familiarity rating on the two types of stimuli for
each individual participant.
31 It was not appropriate to fit a linear mixed-effects model to the standardized
familiarity ratings using GROUP, ITEMTYPE, and their interaction as fixed effects, like
we did with the stereotypy scores (Section 4.3.2) and the voice onset times
(Section 4.4.2). Such an analysis reveals significant differences across groups in
ratings on the News report stimuli, suggesting that these phrases are significantly
more familiar to the Inexperienced participants than to the Job-seekers and the
Recruiters, while such a conclusion is not justified. Since we had to standardize
the ratings, we cannot tell whether these phrases are more familiar to the
Inexperienced participants than to the others in absolute terms (cf. Chapter 3,
Section 3.5). What we can conclude is that the Inexperienced participants
consider the News report stimuli to be more familiar than the Job ad stimuli, while
for the Job-seekers and the Recruiters it is the other way around.
Metalinguistic judgments are psycholinguistic data
109
Figure 5.2 The difference between the mean standardized familiarity rating on
Job ad stimuli and the mean standardized familiarity rating on the
News report stimuli for each individual participant; black bars show
each group’s mean difference. A circle below zero indicates that that
participant assigned higher ratings to News report stimuli than to Job
ad stimuli.
5.4.2 Corpus-based frequencies and participant-based psycholinguistic data as
predictors of familiarity ratings
Every data point in Figure 5.1 represents the average of the familiarity ratings a
participant assigned to 35 stimuli. These items were expected to vary in degree
of entrenchment and, consequently, the familiarity ratings were expected to vary
too. By means of mixed-effects models, we examined to what extent the variance
in ratings can be explained by corpus-based and participant-based measures, and
whether there are effects of presentation order and block. We incrementally added
predictors and assessed by means of likelihood ratio tests whether or not they
significantly contributed to explaining variance in ratings. A detailed description
of this model selection procedure can be found in Appendix 5.3. The corpus-based
measure phrase frequency contributed to the fit of the model; lemma frequency
did not, and therefore it was left out. TARGETMENTIONED –a measure of the
predictability of a target for a given participant– improved model fit, as did VOT –
the time it took the participant to start pronouncing the target word when
110
Chapter 5
**
**
-1.98
0.02
-0.04
VOT x TM=yes
Note: Significance code: 0.01 ‘**’
2.35
-3.94
0.03
LogFreqPhrase x TM=yes
0.10
0.23
-0.11
Block=2
-1.14
0.01
VOT
0.03
0.45
-0.01
TargetMentioned=yes
13.87
-0.41, -0.07
0.20, 0.46
0.37, 0.53
-0.05, 0.02
-0.02, 0.48
-0.18, -0.04
-0.10, 0.01
-3.60
6.65
0.05
0.33
0.07
-0.24
LogFreqPhrase
99 % CI
t
SE
Estimate
(Intercept)
Table 5.1
Generalized linear mixed-effects model (family: Gaussian) fitted to the
standardized familiarity ratings, using Target not mentioned as
the reference condition.
**
presented following the cue. Presentation order did not improve model fit, while
block did. None of the interactions of block and the other variables was found to
improve goodness-of-fit. The interactions of TARGETMENTIONED and the other
predictors were included as they did contribute to the fit of the model.
The resulting model is summarized in Tables MMM and NNN. Table 5.1
presents the outcomes when Target not mentioned is used as the reference
condition. The intercept here represents the mean rating when the target had not
been mentioned by participants and all of the other predictors take their average
value. A predictor’s estimated coefficient indicates the change in ratings
associated with every unit increase in that predictor. The estimated coefficient of
LOGFREQPHRASE, for instance, indicates that, when the target had not been
mentioned and all other predictors take their average value, for every unit increase
in log-transformed phrase frequency, ratings are 0.33 higher. Table 5.2 presents
the effects of the predictors when the target had been mentioned in the
completion task.
**
**
**
**
111
1.98
0.02
0.04
VOT x TM=no
Note: Significance code: 0.01 ‘**’
2.35
3.94
0.03
0.11
LogFreqPhrase x TM=no
0.10
0.23
Block=2
-3.36
0.02
VOT
0.03
-0.45
-0.06
TargetMentioned=no
-13.87
0.03, 0.39
0.08, 0.35
-0.53, -0.37
-0.10, -0.01
-0.02, 0.48
0.04, 0.18
-0.01, 0.10
2.97
4.26
0.05
0.22
0.07
0.21
LogFreqPhrase
99 % CI
t
SE
Estimate
(Intercept)
Table 5.2
Generalized linear mixed-effects model (family: Gaussian) fitted to the
standardized familiarity ratings, using Target mentioned as the reference condition.
Metalinguistic judgments are psycholinguistic data
First of all, the model shows an effect of TARGETMENTIONED. When participants had
mentioned the target word in the completion task, the phrase was given
significantly higher familiarity ratings than when the target word had not been
mentioned.
In addition, phrase frequency (LOGFREQPHRASE) proved to have an effect.
Higher-frequency phrases were assigned higher familiarity ratings than lowerfrequency phrases. This influence of frequency was significantly stronger when
the target word had not been mentioned, as is evidenced by the interaction
between LOGFREQPHRASE and TARGETMENTIONED. The effects of TARGETMENTIONED
and LOGFREQPHRASE, and the interaction, are visualized in Figure 5.3. All along the
frequency range, ratings were significantly higher when the target had been
mentioned by the participants in the preceding completion task. The effect of
TARGETMENTIONED is more pronounced for lower-frequency items (the distance
112
Chapter 5
Scatterplot of the log-transformed
Figure 5.3
corpus frequency of the phrase, and the standardized
familiarity ratings, split up according to whether or not
the target word had been mentioned by a participant in
the preceding completion task. Each circle represents
one observation; the lines represent linear regression
lines with a 95% confidence interval around it.
Metalinguistic judgments are psycholinguistic data
113
between the red and the blue line being larger on the left side than on the right
side). The effect of phrase frequency on familiarity ratings, with more frequent
phrases being assigned higher ratings, is indicated by the ascending lines. The
phrase frequency effect is significantly stronger when the target had not been
mentioned by participants (the red line being steeper than the blue line).
Finally, VOT proved to have an effect when the targets had been mentioned in
the completion task. In those cases, it was found that the less time it had taken
participants to start pronouncing the target word, the higher the familiarity ratings
that were assigned to the phrase. When participants had not mentioned the
targets in the completion task, by contrast, the corresponding voice onset times
did not predict ratings.
5.5 Discussion
The data presented in this chapter led to two main findings: differences in
experiences with a particular register are reflected in the familiarity ratings that
participants assign to phrases characteristic of that register; and individual
participants’ data from a completion task and a Voice Onset Time task are
significant predictors of the familiarity ratings they assign to the stimuli. These
findings have three important implications. First, they indicate that familiarity
judgments and other types of psycholinguistic data tap into the same mental
representations of language, and that familiarity ratings form useful data to gain
insight into these representations. Second, they provide support for usage-based
theories. Third, they add to a growing body of research that points to the
significance of individual differences and the merits of going beyond
amalgamated data.
In part, these implications follow from the analysis of the differences across
groups of participants. Given the differences between the groups in experience
with word sequences characteristic of job ads, a usage-based account predicts
differences in linguistic representations across groups. If familiarity judgments
give expression to linguistic representations, the ratings should reflect these
differences, just like the data from the completion task and the Voice Onset Time
experiment did. This prediction was borne out. The vast majority of the Recruiters
deemed the Job ad phrases to be more familiar than the News report phrases,
while for the Inexperienced participants it was the other way around. The Jobseekers are positioned in between the other groups. This pattern resembles the
patterns observed in the completion task data (Section 4.3.2) and the voice onset
times (Section 4.4.2): most Recruiters obtained a higher stereotypy score and
responded faster on Job ad stimuli than on News report stimuli; for the
Inexperienced participants it was exactly the other way around; and the Jobseekers took a middle position. Each of these three types of experimental data
114
Chapter 5
thus points to usage-based variation in mental representations of multi-word
units.
The fact that the judgment data revealed the same patterns we observed in
the data from psycholinguistic processing tasks is in line with findings from prior
research that related familiarity ratings to processing times (e.g. Blasko & Connine
1993; Caldwell-Harris et al. 2012; Juhasz & Rayner 2003; Juhasz et al. 2015;
Tabossi et al. 2009; Williams & Morris 2004). While insightful, these analyses are
limited in that they average across participants. In prior research, the judgment
tasks and the processing tasks were conducted with different participants. As a
result, one cannot distinguish differences across tasks from individual differences
which are stable across tasks. By having the same participants perform a
metalinguistic judgment task as well as processing tasks and analyzing the data
at the level of individual participants, we are able to account for individual
differences which are stable across tasks.
We fitted mixed-effects models to the full set of ratings to examine to what
extent the variance in ratings can be explained by corpus frequencies and
participant-based psycholinguistic data. If the familiarity ratings index the extent
and type of previous experience participants have had with the stimulus, then
corpus frequencies ought to be a significant predictor of those ratings as they
capture variation across items in frequency of occurrence. Corpus-based
measures were not expected to explain the variance fully, though, since the corpus
is merely a rough approximation of the participants’ experiences. Participantbased measures were hypothesized to have additional explanatory power, as they
can account for individual differences. If expectations (recorded in the completion
task), processing speed (measured in the VOT experiment), and familiarity
judgments all reflect the degree to which linguistic units are entrenched in a
speaker’s mind, then a participant’s psycholinguistic responses from the first two
tasks can predict that person’s familiarity ratings.
As hypothesized, data elicited from an individual participant in other
psycholinguistic tasks using the same stimuli constituted a powerful predictor for
that person’s familiarity judgments. Not surprisingly, corpus-based phrase
frequency (LOGFREQPHRASE) exerted influence, with more frequent phrases being
assigned higher ratings. But on top of that, completion task responses and voice
onset times were significant predictors. When participants had mentioned the
target word in the completion task, they gave significantly higher familiarity
ratings to the phrase than when they had not mentioned it. Furthermore, when
participants had predicted the final word of a stimulus to occur given the
preceding words, corpus-based phrase frequency exerted less influence on their
familiarity ratings. In that case, their own voice onset times formed a significant
Metalinguistic judgments are psycholinguistic data
115
predictor; the faster they had read aloud the target word in the VOT task, the higher
the ratings they assigned.
The fact that participants’ data from psycholinguistic processing tasks
constitute significant predictors of their familiarity judgments when corpus-based
frequencies have already been added to the statistical model, is not self-evident.
The different tasks each have their own characteristics and restrictions. Voice
onset times are more susceptible to noise (e.g. effects of a sneeze or a lapse of
attention) than completion task responses or familiarity judgments expressed
without time constraints. TARGETMENTIONED, being a binary measure of the
predictability of a word, cannot account for gradient differences in entrenchment,
while the other two measures can. Metalinguistic judgments, in turn, are said to
reflect beliefs and decision-making biases that do not affect online processing, or
at least much less so. Nonetheless, there were significant relationships between
the different types of data. To be sure, there is variance that is unexplained, and
follow-up studies can contribute to a better understanding of the ways in which
the tasks differ from each other. Still, the statistical relationships between
familiarity ratings and other types of psycholinguistic data as well as corpus
frequencies may remove some of the doubts about the usefulness of
metalinguistic judgments, at least in investigations of mental representations of
multi-word units. The relationships suggest that the different tasks tap into the
same linguistic representations. Metalinguistic judgments can be considered
psycholinguistic data, just like voice onset times and completion task responses,
and they can be as useful as other types of psycholinguistic data to gain insight
into linguistic representations.
Our findings showcase the added value of collecting different types of data
from the same participants. This practice yields more insight into individual
differences and variation across different measures that aim to tap into degree of
entrenchment. When researchers work with data sets from different speakers,
they are not able to tell to what extent variation is to be ascribed to task-related
differences on the one hand, and individual differences in linguistic experience and
cognitive abilities on the other. We urge other researchers to conduct multiple
tasks among the same speakers, as this will advance our understanding of the
cognitive and experiential underpinnings of mental representations of language
and the ways in which these representations manifest themselves in various
linguistic activities.
116
Chapter 6
Abstract
Chapters 2 through 5 reported on multi-method studies that involved corpus data
as well as offline and online experimental data. What the outcomes imply for
theories of mental representations of language is discussed in Chapter 7. The
present chapter focuses on the methodological lessons that can be learned from
our studies. This chapter highlights the merits of multi-method research in
linguistics and may help designing such research. It discusses methodological
and practical concerns in the selection of corpus data, metrics to analyze corpus
data, stimuli, experimental tasks, and participants, using the studies reported in
the previous chapters as case studies.
This chapter is based on:
Verhagen, V., Mos, M., Backus, A. & Schilperoord, J. (submitted). A concise guide
to the design of multi-method studies in linguistics: Combining corpus-based
measures with offline and online experimental data. Dutch Journal of Applied
Linguistics. Manuscript submitted for publication.
A concise guide
Chapter 6
117
A concise guide to the design of multimethod studies in linguistics: Combining
corpus-based measures with offline and
online experimental data
6.1 Introduction
It is worthwhile to make use of different types of data in linguistic research. This
may involve combining data from different sources, using different methods, and
integrating quantitative and qualitative approaches, each of which contributes to
triangulation (Bryman 2004; Hammersley 2008). Different kinds of data can
complement each other, thus yielding a fuller and more precise picture of the
phenomena under investigation and more insight into the characteristics and
limitations of distinct types of data.
To properly design and conduct a multi-method study requires knowledge from
a variety of domains. This chapter presents an overview of the steps to be taken
and the decisions to be made. It illustrates this using the studies described in
Chapters 2 to 5 as case studies and provides references to useful handbooks and
best practices.
In all of our studies, we investigated the relationship between corpus data and
experimental data. Corpus data can complement experimental data, for example
with respect to ecological validity, contextualization, and scope. Additionally, a
comparison of experimental data and corpus data may be used to assess the
representativeness of a corpus for particular language users and to test
hypotheses regarding the way people process particular linguistic constructions,
formulated on the basis of the distributional patterns in a corpus.
Additionally, in Chapters 4 and 5, we examined the relationship between
different types of experimental data, by analyzing to what extent performance on
one task is a significant predictor of performance on another task. Such analyses
enhance our understanding of the extent to which different types of data rely on
the same linguistic representations, and how they complement each other.
Moreover, we showed the added value of conducting multiple tasks with the same
participants – a methodological approach which is, as yet, seldom used in multimethod studies. It makes it possible to distinguish variation across tasks, on the
one hand, from variation between participants which is stable across tasks, on
the other.
It is not just the combination of different kinds of data that may yield a more
complete picture; multiple measurements using the same method can also
118
Chapter 6
contribute to more accurate conclusions. In the studies described in Chapters 2
and 3, participants performed the same task twice. We examined the test-retest
reliability of metalinguistic judgments and we gained insight into the degree of
intra-individual variation relative to inter-individual differences. If the intraindividual variation from one moment to the other reflects the genuine dynamism
of linguistic representations, multiple measurements are required to describe this.
Chapters 2 and 3 also yielded insight into the extent to which the outcomes of
experiments depend on choices like the type of scale that is used (a 7-point Likert
scale or a Magnitude Estimation scale) and whether stimuli are presented in a
sentential context or as an isolated word string.
The insights from these studies are used here to discuss the following steps
in multi-method research design: selecting corpus data, metrics to analyze corpus
data, and stimuli, selecting and designing experimental tasks 32, and selecting
participants. Each section concludes with a text box which summarizes the most
important considerations, together with useful references.
6.2
Steps in the multi-method approach
6.2.1 Selecting corpus data
A corpus is a collection of texts that can be analyzed for various purposes. It
enables you to gain insight into natural language use, obtain large numbers of
instances of a linguistic construction (more than is possible via introspection or
elicitation) and examine distributional patterns. Such information is of great value,
not just in descriptive linguistics; it can serve as a basis to formulate hypotheses
on linguistic representations and language processing, and to select items to be
used in experiments.
One’s research interests determine which kinds of data are suitable. There is
a wide variety of corpus types (see Lüdeling & Kytö 2008 for an overview),
differing in terms of size; medium (e.g. written text, transcribed spoken text, audio,
video); the time periods that are covered by the texts; the availability of
annotations (e.g. part-of-speech tags, lemmatization) and metadata (e.g. text
type, information on the writers/speakers); the ways in which the compilers
strived to make the corpus representative for a particular language, variety, or
register, and balanced such that the proportional sizes of the corpus parts are
similar to those in the language, variety, or register. There are no straightforward
rules on how to compile or select a good corpus; it greatly depends on your
research goal. That is not to say that there are no guidelines (see text box 1).
32
While acknowledging the merits of other methods, such as ethnography (Levon
2013), interviewing (Schilling 2013), and computational modeling (Pearl 2010),
we limit our discussion to experiments.
A concise guide
119
In our studies, we used existing corpora (i.e. Corpus Gesproken Nederlands,
SoNaR, Twente Nieuws Corpus, and NLCOW14), as well as a corpus consisting
of Dutch job advertisements that was composed for the purpose of our study. A
Dutch job ad corpus did not yet exist. Textkernel, a company that is specialized in
information extraction, web mining and semantic searching and matching in the
Human Resources sector, created one for us. One of its software modules
automatically searches the Internet for new job ads every day. All the job ads
retrieved in the year 2011 (slightly over 1.36 million) were compiled, yielding a
corpus of 488.41 million words. The past decades have seen developments of
software tools and programming languages that make it easier to create corpora
and parse and tag the data (see Gries & Newman 2013). It should be noted,
though, that a corpus must be constructed carefully for it to be representative.
When building a corpus to be used in multi-method research, it may be possible
to collect texts that have been produced (or processed) by the people who will
also take part in the experiments. The corpora we used consist of data that our
participants had not written themselves, nor had they read all of those texts. Still,
the corpora can approximate to their linguistic experiences. It is worthwhile to
examine the possibility of compiling a corpus from texts that are produced by the
participants themselves. Such a corpus makes it possible to tailor experimental
stimuli to a participant’s own language use (see, for example, Barking et al.
submitted) and to compare corpus-based measures based on either
amalgamated data or personal data on how well they correlate with participants’
experimental data.
120
Chapter 6
Text box 1. Considerations regarding the selection of corpus data.
What do you want to use the corpus for?
(E.g. to establish characteristics of a particular text type or register, to
discover characteristics of particular linguistic constructions, to
compare corpus data to experimental data)
What kind of information should the corpus contain?
- Annotations (e.g. part-of-speech tags, lemmatization, phonological
annotations)?
- Metadata regarding the texts and/or the authors (e.g. date and place
of publication, text type, the writer/speaker’s gender, age and
nationality)?
Is there an existing corpus that meets your requirements?
Consult overviews such as:
- Lüdeling and Kytö (2008)
- http://corpus.byu.edu/
- http://martinweisser.org/corpora_site/CBLLinks.html
- http://www.inl.nl/taalmaterialen#corpora
Or should you compile one?
Gries and Newman (2013) give useful advice on how to collect and
prepare corpus data.
Take into account copyright and privacy issues (see Treadwell 2017
Chapter 3 on collecting Internet content). Check whether your research
institute and/or the venue for publishing your work require a research
ethics committee to approve of the data collection.
6.2.2 Corpus analysis
Once the corpus data have been selected, you need to decide how to extract
information from it. There are an overwhelming number of ways to analyze corpus
data and there are ongoing debates (e.g. Bybee 2010; Gries 2012; Schmid &
Küchenhoff 2013) as to what metric is most suitable given a particular goal (e.g.
do you aim to gauge the predictability of a linguistic unit, its conventionality, its
degree of entrenchment out of context, the mutual attraction of lexemes and
constructions, the productivity of a construction?). All corpus metrics concern
distributions of some kind. Gries and Newman (2013) distinguish three types of
distributions of linguistic units: frequencies and dispersion (i.e. how often and
where does something occur in a corpus); collocations (i.e. how often do linguistic
units occur in close proximity to other linguistic elements); concordances (i.e.
how are linguistic units used in their actual contexts, ranging from a few words to
whole sentences).
A concise guide
121
The choice of metrics is motivated by what the corpus-based measures ought
to capture, and what the subject of inquiry and the data allow for. If you examine
the strength of association between particular words, for instance, you have a
choice between unidirectional (either →, or ←) or bidirectional (↔) measures.
This choice matters in particular when the strength of association is asymmetric,
meaning that one word is more predictive of the other than the other way around
(e.g. the word course is often preceded by of, while there are many different words
that tend to follow of). When using a corpus-based association measure to predict
word-by-word self-paced reading times, a unidirectional measure from left to right
(e.g. how predictable is ‘president’ given the word ‘vice’, without taking into
account to which extent ‘president’ is predictive of ‘vice’) may be most in line with
the way participants process the language.
It is important to take into consideration that the characteristics of the corpus
data and the linguistic constructions of interest may constrain the options.
Collostructional analysis (Stefanowitsch & Gries 2003), for instance, is a useful
method to analyze the distribution of lexemes in alternating grammatical
structures, a common example being the dative alternation. In the case of the
dative alternation, collostructional analysis assesses the degree to which
particular verbs are attracted to either the prepositional dative (e.g. she gave the
book to him) or the ditransitive construction (she gave him the book). This
analysis requires determining the frequency with which the target verb (e.g. give)
occurs in the target construction (e.g. the prepositional dative), the frequency with
which all other verbs occur in the target construction, and the frequency with
which the target verb occurs in other constructions. Crucially, in some cases, it
may not be possible to define and trace “all other constructions” (this has been
called the cell no. 4 problem, see Schmid & Küchenhoff 2013; Bybee 2010 p.98).
Apart from the question whether a count of the number of (inflected) verbs can
be considered a good proxy for this, you may be faced with the problem that the
corpus is not tagged accordingly. In that case, there may be an appropriate tagger
available or it might be possible to write a script that can identify and classify
relevant parts of speech.
After the selection of metrics, there are usually more decisions to be made. It
may be necessary to determine what you consider to be instances of the same
construction (e.g. spelling differences like color and colour; contracted forms like
haven’t you and have you not; different forms of the same lemma like info and
information or has, had, and having). Furthermore, the window around the target
item –that is, the amount of context that is taken into account– is to be decided
on, as well the possibility to allow for intervening words (e.g. allowing for een sterk
analytisch en probleemoplossend vermogen ‘strong analytical and problem
solving skills‘ to be retrieved when searching for een sterk analytisch vermogen).
122
Chapter 6
In addition, it may be important to distinguish between homonyms or different
senses of a polysemous word, especially when a query targets a single word. For
example, the noun vermogen can mean ‘property’, ‘fortune’, ‘capital’, ‘power’,
‘ability’. If some of these uses are irrelevant given the research question, it may
be useful to employ word sense disambiguation tools (e.g. WordNet, Princeton
University 2010; Agirre & Edmonds 2007).
Finally, when the queries have yielded results, certain transformations may be
required. Many metrics are subject to sample-size effects. For example, typetoken ratio –a measure of lexical diversity– is known to be affected by text length,
with longer texts yielding lower TTR values. If the text segments to be compared
in terms of lexical diversity are not of equal sizes, an adjusted score like the mean
segmental type-token ratio or the measure of textual lexical diversity can be used
(these measures hold either the sample size or TTR constant, see Jarvis 2013).
To compare simple frequencies of occurrence of a linguistic construction across
(sub)corpora of different sizes, they are normalized as a ratio of occurrences per
million words. If the frequencies are to be used as a predictor of experimental data
such as processing speed or metalinguistic judgment, it is common to logtransform them, since the relationship between frequency and learning,
recognition, and production has been shown to be logarithmic rather than linear
(Baayen 2001; Howes & Solomon 1951; Tryk 1986).
To illustrate how research goals and the characteristics of both corpus data
and experimental design direct the analysis of the corpus data, the study
described in Chapters 4 and 5 can serve as an example. In that study, we were
looking for a metric that quantifies the extent to which a word string is
characteristic of job advertisements or news reports. We made use of the loglikelihood statistic, following the frequency profiling method of Rayson and
Garside (2000). This method enabled us to discover features in the corpora that
distinguish one corpus from the other (Kilgarriff 2001). It identifies n-grams
whose occurrence frequency is statistically higher in one corpus than another,
thus appearing to be characteristic of the former. In our comparison of the Job
ad corpus and the TwNC, we focused on n-grams ranging in length from three to
ten words. In order to bypass an enormous amount of irrelevant strings such as
Contract Soort Contract (‘Contract Type of Contract’), which occur in the headers
of the job ads, we applied the criterion that a string had to occur at least ten times
in one corpus and two times in the other. As the irrelevant strings do not occur in
the TwNC, they are ignored when using this criterion. In this way, we obtained two
lists: one containing n-grams that appear to be most typical of job ads, and one
containing n-grams that appear to be most typical of news reports.
We then determined corpus-based string frequency, lemma frequency of the
final word, and the degree of unexpectedness of the final word given the preceding
A concise guide
123
words (i.e. surprisal), as these features were expected to affect the way in which
the words are processed and judged in our experiments (see Section 6.2.4). Our
stimuli are composed of a cue (e.g. goede contactuele … ‘good
communication …’) and a target word (e.g. eigenschappen ‘skills’). The words
constituting the cue were presented all at once, and we did not record the speed
with which the component parts were processed. Therefore, we did not use
corpus-based measures that analyze the internal structure of the cue, and we
calculated the surprisal of the target word given the cue as a whole. To obtain
surprisal estimates, language models were trained on the generic corpus. A 7gram model was used, since the length of our word strings did not exceed seven
words.
Text box 2. Considerations regarding the analysis of corpus data.
What do you want your corpus-based measure to reveal?
Do your corpus data impose restrictions (e.g. lack of particular kinds of
annotations or metadata)?
What type of metrics is most suitable? (Gries 2010a)
(i) Frequencies and dispersion (i.e. how often and where does something
occur in a corpus)
(Gries 2008)
(ii) Collocations (i.e. how often do linguistic units occur in close
proximity to other linguistic elements)
(Wiechmann 2008; Divjak 2016)
(iii) Concordances (i.e. how are linguistic elements used in their actual
contexts)
(Sinclair 1991; Gries 2010b)
Can you make use of an interface in which corpus analysis tools are
integrated? For example:
-
https://portal.clarin.inl.nl/opensonar_whitelab/page/search
http://liwc.wpengine.com/
-
http://www.lexically.net/downloads/version5/HTML/,
http://lexically.net/wordsmith/step_by_step_Dutch6/index.html?intro
duction.htm
Which variants of a construction do you want to include or exclude (e.g.
spelling differences, contracted forms)?
Are transformations required? If so, which? (Gries 2010a)
124
Chapter 6
6.2.3 Selecting stimuli
In the studies presented in this book, the selection of stimuli for experimental
research was based to a large extent on corpus analyses. Analyses of corpus data
often play a role in this phase in multi-method research, as such data provide
information about characteristics of the linguistic items (e.g. frequency of use,
collostructional strength, prototypicality, predictability) that can be used to
identify suitable items and predict the way they are processed or rated in
experimental tasks.
In the selection of stimuli for experimental research, three main considerations
play a role: What is it that the stimuli ought to represent? Which factors ought to
be controlled for? How many items are required? Say you want to conduct an eye
tracking experiment to examine whether abstract words are processed more
slowly than concrete words. Word length and word frequency are known to affect
the time it takes to process a word (e.g. Balota, Cortese, Sergent-Marshall, Spieler
& Yap 2004), but such effects are not of interest to you. Therefore, you have to
control for them. While it may be hard to find sufficient suitable stimuli that do
not differ at all in length or frequency, it may be feasible to apply pairwise
matching: find a matched control word for every stimulus (i.e. two items that are
alike in length and frequency, yet differ in concreteness), or to account for length
and frequency effects in the analyses, by including these factors as covariates.
Usually, stimuli constitute a sample, just like participants do. In the case of the
concreteness study, the stimuli do not exhaust all possible words in a given
language, and the participants do not constitute all speakers of that language.
Still, researchers intend to generalize to a population, namely to words beyond the
items included in the stimuli set, and to language users beyond the actual people
participating. To obtain replicable results that generalize across participant as well
as stimulus samples, both sample sizes need to be sufficiently large (see Westfall,
Kenny & Judd 2014 for practical tools and guidance). It is important to realize
that a suboptimal sample of stimuli can hardly be compensated for by recruiting
more participants.
In our selection of 35 Job ad stimuli and 35 News report stimuli for the study
reported on in Chapters 4 and 5, we used the following criteria. The words strings
had to end in a noun and they had to be comprehensible out of context. We only
included n-grams that constitute a phrase (more specifically, a noun phrase, a
prepositional phrase, or an adjective phrase). It is not clear whether ‘phrasehood’
could have an effect (cf. Arnon & Cohen Priva 2013; Tily et al. 2009). We decided
to use only phrases, because we presented the items as isolated word strings.
Processing and judging a word string in isolation is less natural for nonconstituents than for constituents (compare, for example, as far as I to at the very
last moment).
A concise guide
125
The word strings were to cover a range of values on two types of corpus-based
measures: string frequency and surprisal of the final word in the string, as we
aimed to investigate how these variables affect processing and familiarity ratings.
Finally, strings were chosen in such a way that in the final set of stimuli all content
words occur only once. The stimuli vary in terms of length and frequency of the
final word; we included those factors in the analyses.
Text box 3. Considerations regarding the selection of stimuli.
What are the categories or ranges that your stimuli ought to cover? See
Cohen (1990) on ‘less is more’ concerning dependent and independent
variables.
Which variables will you control for in stimulus selection and/or analyses?
- Examples of variables to manipulate or control for: (Baayen 2010)
- Databases with norms and ratings for the purpose of stimulus
selection: (Keuleers & Balota 2015 for an overview; Juhasz, Lai &
Woodcock 2015; McRae, Cree, Seidenberg & McNorgan 2005; Nelson,
McEvoy & Schreiber 1998)
Are artificial stimuli required?
Nonwords
- Example of research using Dutch words and nonwords: (Keuleers,
Diependaele & Brysbaert 2010)
- Examples of nonword generators and databases: Wuggy (Keuleers &
Brysbaert 2010; http://crr.ugent.be/programs-data/wuggy), the English
Lexicon Project (Balota et al. 2007, http://elexicon.wustl.edu/), ARC
Nonword Database (Rastle, Harrington & Coltheart 2002)
Artificial language
- Examples: Misyak and Christiansen (2012); Van den Bemd, Mos,
Alishahi, and Shayan (2014)
What is the appropriate sample size? (Westfall et al. 2014)
6.2.4 Selecting and designing experimental tasks
There is a whole range of experimental methods to choose from, differing on
several dimensions. They vary in terms of the modality in which stimuli are
presented or produced (e.g. visual, auditory) and whether they involve language
comprehension, production, and/or judgment. Furthermore, methods can be
characterized as more online or more offline, the former meaning that the method
taps into real-time aspects of language processing (e.g. eye tracking), the latter
that it assesses the outcomes of this process (e.g. how participants interpret or
judge a sentence). Experiments can also be classified as yielding quantitative
126
Chapter 6
and/or qualitative data. In addition, experiments differ as to how natural the
stimuli are, whether participants are to do something they normally do not do, and
how natural the circumstances are in which the task is performed. This has
implications for the ecological validity of the study. Self-paced reading (SPR)
using a word-by-word moving window, for example, can be considered fairly
unnatural. Usually, a sentence is not presented to us one word at a time, and
during natural reading we can backtrack and look ahead, while in SPR this is not
possible.
Since each type of experiment has its advantages and disadvantages, there is
clear added value in combining different types. They can complement each other
and thus offer a more complete picture of the subject of investigation (for more
elaborate considerations see Arppe, Gilquin, Glynn, Hilpert & Zeschel 2010;
Schönefeld 2011). If possible, conduct different experiments with the same
participants. When you compare data of tasks conducted with different
participants, you are faced with individual differences as well as task-specific
contributions to the effects you want to investigate (Connine, Mullennix, Shernoff
& Yelen 1990; Chapters 4 and 5).
When participants are to perform a series of tasks, researchers should
consider the order carefully, taking into account possible carry-over effects
(Myers, Well & Lorch 2010). If they intend to measure effects of surprisal on
processing speed, participants should not have seen the target items before. By
contrast, if participants are to rate the stimuli, it might actually be beneficial if they
have seen all stimuli before making any judgments. In that case, participants have
been found to make fewer and smaller revisions when offered the opportunity,
and their ratings most closely matched objective scores in studies for which such
scores existed (Lilly 2009).
In our last study (Chapters 4 and 5), participants performed a series of tasks
in one session. In the completion task, they read out loud the stimuli of which the
final word had been omitted (e.g. the cue een vliegende … ‘a flying …’) and
completed them by naming all appropriate complements that came to mind
within five seconds. After this first task, participants were given a questionnaire
regarding demographic variables and two short attention-demanding arithmetic
tasks. These small tasks distracted them from the word strings that they had
encountered in the completion task and were about to see again in the voice onset
time (VOT) experiment. At the same time, the tasks prepared them for the
judgment task, illustrating the method of magnitude estimation by which
participants build their own scale (Bard, Robertson & Sorace 1996).
In the VOT experiment, the participants were presented with the same cues,
followed by a particular target word (e.g. start ‘start’), which they had to read
aloud as quickly as possible. We measured the time it took to recognize and
A concise guide
127
pronounce a particular word following a given word sequence. The 70 stimuli were
mixed with 17 filler items, which were of the same type as the experimental items
(i.e. (preposition) (article) adjective noun), but consisted of words unrelated to
these items (e.g. het prachtige uitzicht ‘the beautiful view’). The fillers were new
to the participants and made the task a bit more varied. The fixation mark that
signaled the start of a new trial was displayed on the screen with varying
durations, to prevent participants from getting into a fixed rhythm.
Finally, in the judgment task, participants rated how familiar the 70 word
strings were to them using Magnitude Estimation (ME). We opted for a judgment
task in which participants constructed their own scale, rather than offering a
binary or Likert-type fixed set of rating options. In the study presented in Chapter
3, we compared familiarity judgments expressed on a 7-point Likert scale and a
Magnitude Estimation scale. The two types of ratings did not differ significantly;
both showed a significant effect of phrase frequency (i.e. higher phrase frequency
led to higher familiarity ratings, as expected); and there was a near perfect
Time1–Time2 correlation of the mean ratings in all experimental conditions. Still,
there are some differences worth considering when selecting a particular scale. A
Likert scale, unlike a ME scale, makes it possible to determine whether
participants consider the majority of items to be familiar (or unfamiliar) and to
examine whether all stimuli received a higher rating in a second rating session,
provided that there is no ceiling effect preventing increased familiarity to be
expressed for certain items. On the other hand, there is a risk that the number of
response options on a Likert scale does not match well with the degrees of
familiarity as perceived by participants. When offered the opportunity to
distinguish more than seven degrees of familiarity, 83.3% of the participants in
our study did so. If a Likert scale is opted for, it would be advisable to carefully
consider the number of response options.
Prior to the start of an experimental task, participants practiced with items that
consisted of words unrelated to the experimental stimuli. For each task, we
randomized the stimuli once and kept the presentation order the same for all
participants. The reason for this is that we were interested in variation across
participants and we wanted to ensure that any differences between participants’
responses were not caused by differences in stimulus order. We examined
whether there were effects of presentation order (such as shorter response times
in the course of the experiment because of familiarization with the procedure, or
longer response times because of fatigue or boredom) by including the factor
presentation order as a predictor in the statistical analyses.
Another decision to be made is whether the stimuli are presented in isolation
or embedded in a context. This may affect the generalizability of the results. In
natural language use, linguistic items are encountered in a context and this
128
Chapter 6
context can influence the way in which words are interpreted, processed, and
responded to. In our first study (Chapter 2), we investigated potential effects of
context on familiarity judgments. Participants rated 44 prepositional phrases
which were presented as isolated word strings and embedded in a sentence that
constituted a prototypical context, resembling the contexts in which the phrase
occurred most frequently in the Corpus of Spoken Dutch (CGN). Adding such a
context did not yield significantly different judgments. Whether this also holds for
other kinds of contexts, varying in size and prototypicality, is yet to be
investigated. For possible effects of context in other types of experiments, see
studies like those by Burmester et al. (2014), Camblin et al. (2007), and Griffin
and Bock (1998) and overviews like Kuperberg and Jaeger’s (2016).
Lastly, it is worth considering the insights that can be gained by having
participants perform the same experiment twice. While the merits of combining
different types of experiments (e.g. Arppe et al. 2010) and replicating a particular
study (Andringa & Godfroid 2019; Koole & Lakens 2012) are acknowledged and
promoted these days, there seems to be less attention for the value of multiple
measurements using the same method, stimuli, and participants. Different kinds
of tasks may complement each other (Schönefeld 2011); replications reveal to
what extent findings hold when new participants and/or new stimuli are used
(Schmidt 2009). The added value of conducting an experiment twice with the
same group of subjects is that it leads to a better understanding of the dynamism
of mental representations within one language user. Multiple measurements may
reveal that the picture that emerged from a single measurement is incomplete
and oversimplified.
If participants are to perform an experiment twice, the researcher will have to
decide on the test-retest interval. In our first two studies, in which we examined
intra-individual variation in metalinguistic judgments across time, participants
completed the task twice within a period of one to three weeks. They knew in
advance that the experiment involved two test sessions, but not that they would
be doing the same task twice. We opted for this time frame as it would be short
enough for the construct being tested not to have changed much (i.e. perceived
degree of familiarity of phrases that may be used in everyday life, based on at
least 18 years of linguistic experiences), yet long enough for the participants not
to be able to recall the exact scores they assigned to each of the 88 (Chapter 2)
or 79 (Chapter 3) stimuli. The experimental design allowed us to examine
variation in judgments within participants from one moment to other, and to
compare this to variation between participants.
A concise guide
Text box 4.
129
Considerations regarding the selection and design of experimental
tasks.
What do you want your experimental data to reveal?
- Overview of judgment tasks: Schütze and Sprouse (2013)
- Overview of experimental paradigms: Blom and Unsworth (2010),
Kaiser (2013)
- Consider the added value of collecting different types of data
(Schönefeld, 2011)
What type of design is most useful for your research questions?
Fully crossed, counterbalanced, stimuli-within-condition, participantswithin-condition, both-within-condition (Westfall et al. 2014)
In what order should you present your tasks and stimuli?
Randomized, counterbalanced, kept constant (Myers et al. 2010: 412)
Will you embed the stimuli in a context (see Burmester et al. 2014; Camblin
et al. 2007; Kuperberg & Jaeger 2016; Chapter 2)? If so, consider the
position of the stimulus in the context and the extent to which this may
affect processing (e.g. wrap-up effects in eye tracking).
Will you include breaks during tasks? If so, consider starting with a filler
item right after a break in an online task, just in case the participants are
not yet fully focused upon recommencement.
Will you conduct an experiment twice with the same participants? See
Chapter 2 for the merits of doing this and considerations regarding the testretest interval.
Draw up a protocol, to ensure that all participants are tested in the same
way, and conduct a pilot study to detect mistakes, bugs, and lack of clarity.
6.2.5 Selecting participants
In the process of selecting participants, the first step is to define the population
you intend to generalize to. After the characteristics of the population have been
specified, participants can be selected in such a way that they constitute a
representative sample that lends itself to generalizations. Note that while the
majority of the publications in behavioral sciences is based on data from Western
undergraduates, this subpopulation is among the least representative groups of
participants for generalizing about human behavior in general (Henrich, Heine &
Norenzayan 2010). Recruiting other types of participants requires some more
creativity. If you are looking for a sample of Dutch adults and data collection can
take place outside a lab, you could think of visiting the Driver and Vehicle Licensing
Agency –an agency that is visited by people from all strata of society– and inviting
the visitors who are waiting there for someone taking the driving test to take part
130
Chapter 6
in your research. This sample of participants is a more faithful reflection of Dutch
society than a group of undergraduates.
Subsequently, consideration should be given to the information that is required
to adequately categorize and characterize participants, and to account for their
performance in experiments. The challenge is to strike a balance; on the one hand,
researchers should not purposelessly collect all kinds of background information,
on the other hand, they should not simply assume that particular people will, or
will not, differ from each other in certain respects. For example, kindergartners in
Wassenaar –an affluent suburb of The Hague, populated by a large expatriate
community– need not all be monolingual speakers of Dutch. This should be
verified, for example by means of a questionnaire. By gathering information
regarding relevant variables, it is possible to determine whether the participants
meet the requirements (e.g. monolingual Dutch) and to find out whether they
differ from each other on confounding variables (e.g. working memory capacity;
see text box 5 for examples of other variables and methods to assess them). If
the latter proves to be the case, these data can be included in the analyses as
covariates. Alternatively, matching can be used to ensure that the groups to be
compared no longer differ in those respects. In pairwise matching, each
participant in the first group has a match in the second group in terms of working
memory capacity. In groupwise matching, participants are included in the second
group if their working memory capacity falls within the first group’s WMC-range.
Another decision that calls for deliberation is the number of participants to be
included. If the analyses involve statistical tests, too small a sample size can make
the study underpowered. This may give rise to several problems, such as a
reduced chance of finding true effects. It is therefore highly recommended to
conduct a power analysis and determine the appropriate sample size (see
Westfall et al. 2014). If there is a risk of participants dropping out (e.g. in studies
comprising multiple test sessions) or data loss (e.g. in eye tracking research),
more data can be collected as a measure of precaution.
In our last study (Chapters 4 and 5), we were looking for participants who
belonged to one of three groups: recruiters, job-seekers, and people not (yet)
looking for a job. We chose these groups as they were expected to differ in
experience in the domain of job hunting. We strived to make sure that they did not
differ systematically in other respects, such as mother tongue and education level.
We contacted recruitment agencies, as well as HR managers at universities
and colleges in Noord-Brabant (a province in the south of the Netherlands), and
we conducted our study with 40 people with professional experience with job ads.
To recruit job-seekers, we got in touch with the Dutch employee insurance agency
(UWV). They sent out a message to approximately 1200 people who were
registered as having completed higher vocational or university education,
A concise guide
131
informing them about the opportunity to take part in our study voluntarily. 47 of
them were able to participate at that time. To find people who did not (yet) have
any experience with job ads, we invited the first-year bachelor and premaster
students of Communication and Information Sciences at Tilburg University who
were native speakers of Dutch to participate for course credit. 72 students
completed the battery of tasks.
As part of the experimental tasks, participants filled out a questionnaire
regarding demographic variables. After the last experiment, participants were
presented with three questions about their experience with job ad texts. We asked
them how many job ads they had read in the past three months (encompassing
both thorough reading and scanning); how many months there had been in the
past three years in which they read at least 25 job ads per month; whether they
ever had a job in which they regularly read or wrote job ads, and if so, for how
long. 42 students qualified as inexperienced participants, as they reported to have
read either no job ads in the past three months, or a few but less than one per
week. Furthermore, in the past three years there was not a single month in which
they had read 25 job ads or more, and they never had a job in which they had to
read and/or write such ads. As for the job-seekers, we selected those who
reported to have read at least three to five job advertisements per week in the
three months prior to the experiment, and who never had a job in which they had
to read and/or write such ads. This left us with 40 job-seekers.
We made sure that all participants were native speakers of Dutch who had
spent most of their youth in the Netherlands, and who had completed higher
vocational or university education or were in the process of doing so. The groups
could not be matched in terms of age. It was not feasible to find sufficient people
who were of the same age as the recruiters and the job-seekers, yet did not have
any experience with job ads, or highly-educated recruiters and job-seekers who
were of the same age as the inexperienced participants. The difference in age may
play a role in the voice onset time experiment, since older adults have slower
reaction times. This is not an insurmountable issue, as it is possible to account
for structural differences across participants in reaction times in the statistical
analyses of the experimental data (for example, by means of a by-subject random
intercept in mixed-effects models).
Text box 5. Considerations regarding the selection of participants.
What is/are the population(s)?
See e.g Unsworth and Blom (2010) on comparing L1 and L2 children
and adults; Paradis (2010) on comparing typically-developing children
and children with specific language impairment.
132
Chapter 6
What sampling technique should you use?
(E.g. convenience, random, stratified, snowball [Buchstaller & Khattab
2013])
How, and how much, do participants differ on relevant variables? These
data can be used in pairwise or groupwise matching (Paradis 2010), or they
can be included as co-varying factors in the analyses. Examples of
potentially relevant variables and methods to assess them are listed
successively. Only collect those data for which you have a sound theoretical
basis to suspect that they play a role.
Demographic
gender, age, ethnicity,
regional background,
educational background,
occupation, socioeconomic status
Linguistic
mother tongue(s), other
languages spoken, age of
arrival in an L2
environment, learner
motivation, amount of
exposure to a particular
language, language
proficiency, vocabulary
size
Cognitive
working memory
capacity, nonverbal
intelligence, statistical
learning ability, need for
cognition
- Questionnaire (e.g. Gutiérrez-Clellen &
Kreiter 2003; Tanner, Inoue & Osterhout
2014)
- Questionnaire (e.g. Tanner et al. 2014)
- Author Recognition Test (Stanovich &
West 1989) to assess reading experience
- Language proficiency test (e.g. Hulstijn
2010)
- Peabody Picture Vocabulary Test (Dunn &
Dunn 1997)
- SILS Vocabulary Subtest (Zachary 1994)
- Verbal working memory span, assessed
for example in Daneman and Carpenter’s
(1980) or Waters and Caplan’s (1996)
reading span task
- Phonological short-term memory,
assessed for example in a nonword
repetition task (Gupta 2003)
- Short-term memory span, assessed for
example in an auditory Forward Digit
Span task (Wechsler 1981)
- See Olsthoorn, Andringa and Hulstijn
(2014) on working memory capacity
tests for natives and nonnative speakers
A concise guide
133
- Raven’s Advanced Progressive Matrices
test (Raven, Raven & Court 1998)
- the Culture Fair Intelligence Test (Cattell
1971)
- (Non)adjacent-dependencies artificial
grammar learning (Gómez 2002; Misyak
& Christiansen 2012)
- Need for Cognition scale (Cacioppo, Petty
& Kao 1984)
What is the appropriate sample size (Westfall et al. 2014)? Take into
account chances of drop-out and data loss.
Research ethics
Assess potential risks and benefits to your participants and ways to
guarantee confidentiality and anonymity.
Make sure you comply with the current Code of Conduct and
regulations of your research institute. Check whether your institute
and/or the venue for publishing your work require a research ethics
committee’s approval.
Acquire informed consent from participants, and debrief them once
they have completed the tasks (see Eckert 2013, and see Treadwell
2017 Chapter 3 in the case of conducting research on the Internet).
6.2.6 Concluding remarks
The preceding sections discussed methodological and practical considerations in
the selection of corpus data, metrics to analyze corpus data, stimuli, experimental
tasks, and participants. While these steps cover a significant part of the research
process, they are by no means all-encompassing. A well-designed study also
entails careful consideration of data management and transparency with respect
to the goals and the decisions that are made (see text box 6). Furthermore, multimethod research yields multifaceted datasets that require statistical analyses
that do justice to the nature of the data. Which kinds of tests are most suitable
depends on the types of data (e.g. continuous or categorical; the number of levels
of a categorical variable; whether a variable is nested in other variables, as in the
case of children grouped in classes, which are in turn nested in schools) and the
research questions (see, for example, Chapters 14 to 16 in Podesva & Sharma
2013; see Chapters 2 to 5 for the analyses employed in our studies).
Multi-method research is often more challenging than mono-method research
as regards the operationalization of constructs across methods and data
analysis. There is great merit in taking up these challenges, as it leads to more
134
Chapter 6
robust evidence, a more complete picture of the subject of research, and a better
understanding of the characteristics and limitations of different methods. Various
examples of studies in linguistics that successfully combine different types of
data can be found in Schönefeld (2011) and Arppe et al. (2011). What our studies
(Chapters 2 to 5) add to that, is that they showcase the added value of conducting
multiple experiments with the same participants and having participants perform
the same task twice. The present chapter discussed these possibilities as part of
considerations in the design of multi-method studies. Hopefully, the coming years
will see a rise in multi-method studies and constructive debates about the
relationships between different types of data and the cognitive representations
and processes they tap into.
Text box 6. Considerations regarding research project management.
Preregistration and registered reports
Preregistration entails that you register your research questions and
analysis plan prior to data collection (https://cos.io/prereg/). If you opt
for a registered report, your research proposal is reviewed prior to data
collection (https://cos.io/rr/). If accepted, your findings will be published
irrespective of the outcome, provided that you followed the registered
plan or provide justification for deviating from it. Pre-registration and
registered reports help to get nonsignificant findings published and they
foster fair research practices and replications (Nosek & Lakens 2013).
Data management
How will the data be stored, for how long, and who will have access to
which parts? Consider repositories like https://dataverse.nl/,
https://www.surfdrive.nl/.
135
Chapter 7 Discussion
The studies presented in this dissertation aim to contribute to theories about the
mental representation of language, by examining variation in entrenchment of
multi-word units. Cognitive Linguistics takes a usage-based perspective, meaning
that mental representations of language are taken to emerge from, and are
continuously shaped by, language use. The more frequently a speaker encounters
and uses a particular linguistic structure, the more the mental representation of
this structure becomes entrenched. As a result, it can be activated and processed
more quickly, which, in turn, increases the probability that this form is used to
express the given message, making this construction even more entrenched.
Conversely, extended periods of disuse weaken the representation (Langacker
1987: 59). Thus, in a usage-based approach, linguistic representations are
inherently dynamic: they change over time, and they may differ from one person
to another, depending on differences in usage.
While variation naturally follows from a usage-based perspective, surprisingly
little is known about how variable mental representations of language are.
Corpora, which take a prominent position in usage-based research, contain usage
data, but they are nearly always an amalgamation of data from many different
people. They may yield insight into the degree of conventionalization of a linguistic
construction in a community, but they cannot directly reveal degrees of
entrenchment of mental representations (Schmid 2010). As for experimental
data, most researchers analyse and report these data at the level of aggregated
scores, thus masking individual differences. Often, they relate such data to scores
like cloze probabilities and corpus-based measures, which are based on
amalgamated data from yet other people. The merit of such an approach is that
it has demonstrated robust correlations between frequency of occurrence of
linguistic constructions and behavioral indices of cognitive routinization. The
drawback is it that disregards inter- and intra-individual variation, while insight into
variation is a prerequisite for a veridical description of mental representations of
language. In studies of child language acquisition, children’s productions have
been shown to be closely linked to their own prior experiences (e.g. Borensztajn,
Zuidema & Bod 2009; Dąbrowska & Lieven 2005; Lieven, Salomo & Tomasello
2009). In adult native speakers, by contrast, individual differences in
representation and processing of language have received much less attention,
even though such differences are to be expected on theoretical grounds.
Recent years have seen a growth of interest in social and behavioral sciences
in the analysis of individual differences and the limitations of aggregated data as
representative of individuals’ knowledge and behavior (e.g. Isakov et al. 2016;
136
Chapter 7
Nurius & Macy 2008; Seto et al. 2016; Vindras et al. 2012; von Eye & Bogat 2006;
von Eye et al. 2006). Also in the field of cognitive linguistics, this topic is given a
more prominent position (e.g. Andringa & Dąbrowska 2019; Barking et al.
submitted; Barlow 2013; Dąbrowska 2018; Zimmerer et al. 2011). My dissertation
contributes to this strand of research. The studies reported here examined
variation between and within participants in metalinguistic judgments on and
processing of multi-word sequences. They investigated the variation to be
detected and the extent to which this variation can be considered meaningful. In
the present chapter, I summarize the main findings and consider the theoretical
implications.
7.1 Summary of the findings and their implications
Before analyzing variation between and within individual participants, checks were
performed to ascertain that the data display usage-based variation on a more
coarse-grained level. The comparison of data from three groups of participants –
recruiters, job-seekers, and people not (yet) looking for a job– constituted a first
test of usage-based principles. This comparison yielded clear evidence for the
relationship between amount of experience with a particular register and (i) the
expectations people generate about upcoming words given the initial part of a
word string characteristic of that register (Chapter 4); (ii) the speed with which
they process such word strings (Chapter 4); and (iii) how familiar they consider
these word strings to be (Chapter 5). The results indicate that there are
systematic differences in participants’ knowledge and processing of multi-word
units which are related to their degree of experience with these word sequences.
This forms empirical support for a hypothesis that follows from usage-based
theories of linguistic knowledge and language processing. As the three groups
differ in experience in the domain of job hunting, participants’ experiences with
these collocations resemble their fellow group members’ experiences more than
those of the other groups. Consequently, on the job ad stimuli, the variation across
groups in expectations, reaction times, and familiarity judgments is hypothesized
to be larger than the variation within groups. This hypothesis was supported by
the data.
Another finding that was to be expected from a usage-based perspective, is
that corpus-based word and phrase frequency correlated with familiarity ratings
and reaction times. Higher-frequency phrases were assigned higher familiarity
ratings (Chapter 5 as well as Chapters 2 and 3), and when the target word had
not been mentioned in the completion task, higher-frequency words elicited faster
responses than lower-frequency words (Chapter 4).
The next step involved an investigation of individual differences. It is not just
groups of speakers that differ systematically and meaningfully; also at the level
Discussion
137
of individual speakers, there are meaningful differences to be detected. No two
speakers are identical in their linguistic experiences. Usage-based theories thus
predict individual differences in entrenchment of linguistic constructions. Indeed,
there turned out to be significant relationships between data elicited from an
individual participant in different types of psycholinguistic tasks using the same
stimuli. A measure of a participant’s own predictions (recorded in the completion
task) was a significant predictor of that participant’s processing speed
(measured in the voice onset time experiment). Furthermore, individual
participants’ data from the completion task and the VOT task were significant
predictors of the familiarity ratings they assigned to the stimuli. What is more,
these participant-based measures were significant predictors on top of measures
based on amalgamated data of different people (i.e. corpus-based frequencies
and surprisal; cloze probabilities). In other words, participant-based measures
proved to have unique additional explanatory power. This demonstrates the
existence of systematic, measurable inter-individual variation in behavioral indices
of cognitive routinization.
In addition, there is evidence of intra-individual variation which, too, points to
the dynamic character of mental representations of language. A test-retest design
provided insight in this kind of variation and illustrates the added value of multiple
measurements. In the studies reported in Chapters 2 and 3, participants
performed a familiarity judgment task twice within a couple of weeks. The ratings
correlated significantly with corpus-based frequencies, just like familiarity ratings
in Chapter 5 did. Moreover, analyzing the data at the level of aggregate ratings
revealed a near perfect Time1–Time2 correlation. While these findings are
interesting, additional insights are obtained by zooming in. None of the
participants were as stable in their ratings as the aggregated ratings are; and no
single item elicited stable ratings from all participants. The intra-individual
variability of metalinguistic judgments could be interpreted as a lack of precision
in expressing degree of familiarity. However, it is worth considering an alternative
interpretation, namely that the variability in judgments reflects the variability of
mental representations of language – at least of multi-word units; whether this
may hold for other types of constructions as well is discussed in Section 7.2. Most
psycholinguistic tasks that try to tap into the degree of entrenchment of a
linguistic unit in the mind of a speaker, express this in a single value (e.g. a rating,
a reaction time). However, if cognitive representations can best be viewed as
more, or less, densely populated clouds of exemplars that vary in strength
depending on frequency and recency of use, a single score yields an incomplete
picture. Therefore, I not only advocate attending to variation across participants,
I also urge cognitive linguists to carry out multiple measurements per participant.
138
Chapter 7
In sum, the studies yielded support for hypotheses that follow from a usagebased approach, and the insights that were gained upon exploring inter- and intraindividual variation tie in well with such an approach. The findings are an
encouragement to flesh out and refine the usage-based framework by studying
variation. In what follows, I sketch three compelling directions for future research
that build on the work presented in this dissertation. I propose to further develop
participant-based measures, to follow participants in the course of a few weeks
or months, and to examine (partially) schematic constructions in addition to
lexically specific ones. These developments will advance our understanding of the
relations between patterns in aggregated data and individual speakers’ mental
representations.
7.2 Suggestions for future research
The findings presented in this dissertation invite us to further develop theories on
the relationship between language in the community and language in the mind,
and to formulate and test hypotheses on this matter. Various researchers have
drawn attention to the distinction between community-level phenomena and
cognitive phenomena in individual speakers (e.g. Backus 2015; Dąbrowska 2016b;
Schmid 2015). It follows that patterns observed in aggregated data cannot simply
be assumed to be represented as such in the minds of all speakers (e.g.
Dąbrowska 2008, 2010; Schmid 2010; Schmid & Mantlik 2015; Zimmerer et al.
2011). For one thing, linguistic constructions are not used by all speakers to the
same extent. Furthermore, the mental representations which are activated while
processing a particular utterance may differ from one speaker to another, and
within one speaker from one moment to another. Usage-based models of
language would benefit from a better understanding of individual differences and
the relationships between patterns in aggregated data and individual speakers’
mental representations. To this end, more insight is required into the ways in
which, and the extent to which, individuals differ from each other.
My studies contributed to this by examining variation in metalinguistic
knowledge and processing of multi-word units, and contrasting group-based
measures and participant-based measures as predictors of judgment data and
reaction times. Aggregated data, if sufficiently representative, proved to be
significant predictors of participants’ familiarity ratings and voice onset times.
More specifically, cloze probabilities –a measure of word predictability based on
the completion task data of all 122 participants together– significantly predicted
voice onset times (Chapter 4, and corpus-based phrase frequency was a
significant predictor of familiarity judgments (Chapters 2, 3, and 5). These
relationships are to be expected given the large body of work showing correlations
between corpus frequencies and data from psycholinguistic experiments, yet
Discussion
139
more work is needed to fully understand the nature of the relationships between
frequency of occurrence based on aggregated data and individual participants’
performance on psycholinguistic tasks.
To serve as a predictor of speakers’ processing speed and perceived degree
of familiarity, aggregated data have to be representative of the participants’
experiences with the word sequences at hand. This may seem obvious, yet it is
far from easy to specify what exactly qualifies as representative. The content of
the corpus is of crucial importance, even more so than corpus size (see, for
example, Blom et al. 2012). The better the content reflects the linguistic
experiences of the participants, the better the corpus-based measures predict
their performance in experiments, as has been shown in studies that assessed
how well different types of corpus data predict performance of a given group of
participants (e.g. Blom et al. 2012) or how well one type of corpus data predicts
performance of different groups of participants (e.g. Gardner et al. 1987). It may
be beneficial in this respect to work with more specific definitions of speech
communities. This will allow for a more precise analysis of effects of group
membership, and more insights into the extent to which entrenchment is
determined by usage frequency and by other factors (such as cognitive abilities).
As a researcher interested in the use of particular constructions, it would be good
to ascertain whether these constructions are characteristic of particular speech
communities, define the groups in question, and specify ways to determine to
what extent a speaker can be considered a member of a certain community. In
Chapters 4 and 5, I started with a priori constructed groups. On the basis of a set
of criteria regarding work experience in the field of HR and reported number of job
ads read within particular time frames, participants were selected and classified
as belonging to one of three groups (i.e. recruiters, job-seekers, and people not
(yet) looking for a job). Comparisons of psycholinguistic data revealed that the
three groups differ systematically in participants’ knowledge and processing of
multi-word units characteristic of job ads, while they do not differ significantly on
word sequences characteristic of news reports. Subsequently, I analyzed
individual variation within groups. Instead of defining groups a priori, people can
be classified in a data-driven manner, by identifying user profiles in the data on
reported experience (for example, by means of cluster analysis, configural
frequency analyses, latent class analysis, or latent class mixture models (von Eye
et al. 2004, 2006). This may yield different demarcation lines and result in more
(or less) fine-grained groupings than a priori classifications. Subsequently,
analyses of the psycholinguistic data can reveal to what extent aggregate level
statements hold for the different subgroups and for individuals.
My studies indicated that there are significant relationships between
aggregated data and individual speakers’ judgment data and reaction times, while
140
Chapter 7
at the same time aggregated data mask meaningful variation and do not suffice
if the goal is to describe mental representations of language. The variable
TargetMentioned, based on data elicited from an individual participant, had an
effect on voice onset times over and above the effects of target word corpus
frequency and cloze probability (Chapter 4). Participants were significantly faster
to name the target if they had mentioned it themselves in the completion task.
Similarly, in Chapter 5, the participant-based variables TargetMentioned and VOT
were significant predictors of familiarity judgments when corpus-based phrase
frequency had already been added to the statistical model. These findings
demonstrate inter-individual variation which is stable across different types of
experimental tasks. In addition, Chapters 2 and 3 provided insight into intraindividual variation in metalinguistic judgments. Both inter- and intra-individual
variation are characteristic of mental representations of language. Group-based
measures –such as corpus frequencies and cloze probabilities based on
completion task data from different people– are insufficient if the goal is to gain
a better understanding of the dynamic nature of linguistic representations.
To this end, additional measures are to be developed and compared, and I hope
that my studies can serve as an example and incite researchers to build on this.
To obtain data on individual participants’ context-sensitive expectations, I
conducted the completion task (Chapter 4). Participants listed all appropriate
complements that came to mind within five seconds. The responses yield insight
in predictions, reflecting what is top of mind for a participant at a given point in
time. While surprisal estimates based on corpus data allow for gradual differences
across complements in predictability, the set of scores is static. Participants’
responses to a completion task, by contrast, may vary from one person to another,
and from one moment to another. As such, they are better able to do justice to
the variability in associations and ease of activation.
To be able to use the completion task responses as a predictor of processing
speed and perceived degree of familiarity, the data were converted into a score
that indicates for each participant individually whether the target word had been
mentioned or not. This variable, TargetMentioned, proved to be a valuable
measure, being a significant predictor of voice onset times in a subsequent
naming task, as well as familiarity judgments. However, as a binary variable, it
does not account for gradient differences in the degree to which words are
expected to occur. It is worth exploring whether the order in which complements
are listed by a participant, the number of complements, and perhaps the time it
took the participant to come up with a complement provide useful information in
this respect. Furthermore, additional tasks can be used to address the fact that
certain complements which are not listed by a participant may be familiar to that
person nonetheless. It would be interesting to have participants perform the
Discussion
141
completion task twice and examine the variability in answers from one moment
to another. For lower-frequency items, such variation is likely to be larger and
responses may be more strongly influenced by recent experiences (e.g. the
complements participants listed for the stimulus een internationale speler van ‘an
international player of/from’ were often related to the soccer match broadcasted
the night before). In addition, stimuli could be provided with more linguistic
context, which may affect the saliency and ease of activation of particular
complements. These are just a few suggestions as to how the potential of
participant-based data can be explored. My studies form a first step, illustrating
that it is possible to construct participant-based measures and worthwhile to do
so, as they offer new opportunities to gain insight into inter- and intra-individual
variation.
Apart from developing new measures, the outcomes of my studies also form
an invitation to extend the research questions and experimental designs to other
kinds of linguistic constructions and developments over time. The approach I
adopted proved to be an effective way to test hypotheses that follow from a
usage-based approach. That is, a comparison of data from groups of speakers to
reveal usage-based variation was shown to be a fruitful approach in Chapters 4
and 5. The practice of conducting multiple tasks among the same participants
yields more insight into variation across speakers on the one hand, and variation
across different measures that aim to tap into degree of entrenchment on the
other. Conducting the same task twice with the same participants, as in Chapters
2 and 3, allows a better understanding of the relative degrees of inter- and intraindividual variation. This dissertation serves as a proof of concept; the approach
can now be applied more extensively and hypotheses can be formulated and
tested on a more fine-grained level.
For one thing, it would be interesting to follow participants in the course of a
few weeks or months, extending the test-retest design. This can provide additional
insights into the effects of usage frequency on processing speed and perceived
degree of familiarity. It is clear by now that frequency is a key factor. What is not
so clear, is to what extent recency of use matters; whether it makes a difference
whether you used a linguistic item once or twice that day; and whether this works
differently for low-frequency items compared to high-frequency ones. Such
questions can be addressed by tracking people who are in various stages of
acquiring a particular jargon (e.g. job ad jargon as a recruiter, political jargon and
officialese as a city councilor, statistics jargon as an undergraduate) or language
(be it an artificial language like Klingon, or a natural language – a promising
project, in this respect, is Marie Barking’s, which examines developments in usage
and processing of transferred constructions by German students acquiring
Dutch). Suppose a study involves participants who just started to work as a
142
Chapter 7
recruiter and participants who have had this occupation for various amounts of
time. It may be possible to keep track of the texts they read and write during
working hours in the course of the investigation. This allows for personal corpora
that provide information on individual participants’ experiences with a given
linguistic construction over time. Experimental data (e.g. on expectations,
processing speed, phonological reduction, perceived degree of familiarity)
collected in a repeated measures design can be analyzed to identify
developmental pathways over time (Nurius & Macy 2008; von Eye et al. 2004)
and examine the relationships with developments in usage frequency. This will
yield insight into the process of entrenchment, rather than just the product –
something usage-based theories will benefit from.
Another direction for future research, which I have not yet explored, is to
examine other kinds of linguistic constructions. This dissertation focused on
multi-word units – a type of construction that lends itself well to the investigation
of usage-based variation. The next step is to examine whether similar degrees of
variation can be observed for (partially) schematic constructions. From a usagebased perspective, multi-word units are not essentially different from morphemes,
words, partially schematic constructions, or fully abstract schemas. They vary in
size and specificity, but they are in essence all form-meaning pairings whose
linguistic representation emerges from experience with language together with
general cognitive skills such as categorization, schematization, and chunking
(Barlow & Kemmer 2000; Bybee 2010). However, it is as yet an open question
whether (partially) schematic constructions will display similar degrees of
variation as lexically specific constructions.
On the one hand, mental representations of (partially) schematic
constructions, too, are dynamic in nature, as they emerge from linguistic
experiences. On the other hand, schematic constructions tend to have a more
general meaning, a wider range of usage contexts, and a higher frequency of
occurrence than lexically specific constructions, which may result in less interand intra-individual variability. While the specific instances people encounter and
use may differ, the commonalities in meaning and structure could enable them to
arrive at similar abstract representations, and differences in the token frequency
of the schematic construction may be relatively small. Even when there are
individual differences in amount of experience with a schematic construction, the
usage frequencies may be so high across the board that there are no detectable
differences in processing speed across participants (i.e. a ceiling effect, see for
example Street & Dąbrowska 2014; Wells, Christiansen, Race, Acheson &
MacDonald 2009).
Although there are reasons to expect schematic constructions to display
smaller degrees of individual variation, there are reports of variation among adult
Discussion
143
native speakers in knowledge of a range of (partially) schematic constructions
(e.g. postmodifying PP, object cleft, object relative, simple locatives with the
quantifier ‘every’, possessive locatives with the quantifier ‘every’, Polish dative
inflections) and these individual differences seem to be associated with
differences in linguistic experience (Dąbrowska 2008, 2018; Wells et al. 2009).
These findings call for more research into usage-based variation in mental
representations of (partially) schematic constructions, to examine whether it is
similar to what is shown in this dissertation with respect to multi-word units.
Importantly, native speakers do not just differ in the language they encounter
and produce, they also differ in cognitive abilities, such as language analytic
ability, statistical learning ability, fluid intelligence, and cognitive motivation
(Dąbrowska 2018; Misyak & Christiansen 2012). Both linguistic experiences and
cognitive abilities appear to influence the process of schematization and
speakers’ knowledge of grammatical constructions. There are indications that
this does not hold for collocational knowledge in the same manner. Dąbrowska
(2018) found participants’ knowledge of collocations to depend primarily on
experience-related factors (print exposure and years spent in full-time education);
it did not depend on language analytic ability and non-verbal IQ, while performance
on grammatical comprehension and receptive vocabulary tasks did. Such findings
do not conflict with usage-based models of linguistic knowledge, but they do call
for a refinement of the theoretical framework regarding the ways in which mental
representations of language emerge and develop. While representations of words,
multi-word units, and grammatical patterns can still be construed as
constructions that emerge from linguistic experience together with general
cognitive skills, they may differ in the extent to which they rely on various cognitive
and experiential factors. Research that aims to advance our understanding of the
contributions of these factors must pay attention to individual differences. It
seems plausible that highly educated speakers, because of their cognitive abilities
as well as their social backgrounds, tend to receive input that makes schematic
constructions more salient to them. Furthermore, the tasks they face at school
and at work likely invite them to use these schematic constructions more
frequently than less-educated speakers. In order to gain more insight into effects
of amount and type of experience on the one hand, and cognitive abilities on the
other, future studies could make use of (partially) schematic constructions which
are characteristic of particular registers, and participants who vary in experience
with these registers. Dąbrowska (2018) used print exposure (measured by means
of an author recognition test) and years spent in full-time education as variables
that reflect linguistic experience. These are rather coarse-grained measures of the
amount and type of experience with passives, postmodifying PPs, and universal
quantifier constructions. In future studies, it may be possible to identify
144
Chapter 7
constructions which are characteristic of specific registers, and obtain more finegrained information on participants’ degrees of experience with these registers.
Experimental data on knowledge and processing of these constructions can then
be analyzed in relation to amount of experience and cognitive abilities. I hope this
dissertation contributes to this research agenda by demonstrating that it is
feasible and valuable to attend to inter- and intra-individual variation and by
sparking linguists’ enthusiasm for such an approach.
145
References
Agirre, E. & Edmonds, P. (2007). Word sense disambiguation: Algorithms and
applications. Retrieved from https://link.springer.com/book/10.1007%2F978-14020-4809-8
Alegre, M. & Gordon, P. (1999). Frequency effects and the representational status
of regular inflections. Journal of Memory and Language, 40, 41–61.
Altmann, E. G., Pierrehumbert, J. B. & Motter, A. E. (2011). Niche as a determinant
of word fate in online groups. PLOS ONE, 6(5), e19009.
Andringa, S. & Dabrowska, E. (2019). Individual differences in first and second
language ultimate attainment and their causes. Language Learning, 69(S1), 512. doi:10.1111/lang.12328
Andringa, S. & Godfroid, A. (2019). Call for participation. Language Learning, 69,
5-10. doi:10.1111/lang.12338
Arnon, I. & Clark, E. V. (2011). Why brush your teeth is better than teeth: Children’s
word production is facilitated in familiar sentence-frames. Language Learning
and Development, 7, 107–129. doi:10.1080/15475441.2010.505489
Arnon, I. & Cohen Priva, U. (2013). More than words: the effect of multi-word
frequency and constituency on phonetic duration. Language and Speech, 56(3),
349-371.
Arnon, I. & Snider, N. (2010). More than words: Frequency effects for multi-word
phrases. Journal of Memory and Language, 62, 67-82.
Arppe, A., Gilquin, G., Glynn, D., Hilpert, M. & Zeschel, A. (2010). Cognitive corpus
linguistics: five points of debate on current theory and methodology. Corpora,
5(1), 1–27.
Ashton, R. H. (2000). A review and analysis of research on the test–retest
reliability of professional judgment. Journal of Behavioral Decision Making,
13(3), 277–294.
Baayen, R. H. (2001). Word frequency distributions. Dordrecht: Kluwer.
Baayen, R.H. (2010). Demythologizing the word frequency effect: A discriminative
learning perspective. The Mental Lexicon, 5(3), 436-461.
Baayen, R. H., Davidson, D. J. & Bates, D. (2008). Mixed-effects modeling with
crossed random effects for subjects and items. Journal of Memory and
Language, 59, 390-412.
Baayen, R. H. & Milin, P. (2010). Analyzing reaction times. International Journal of
Psychological Research, 3, 12-28.
Baayen, R. H., Milin, P., Filipovic Durdevic, D., Hendrix, P. & Marelli, M. (2011). An
amorphous model for morphological processing in visual comprehension based
on naive discriminative learning. Psychological Review, 118, 438–482.
Baayen, R. H., Tomaschek, F., Gahl, S. & Ramscar, M. (2017). The Ecclesiastes
principle in language change. In M. Hundt, S. Mollin & S. E. Pfenninger (Eds.),
The changing English language: Psycholinguistic perspectives (pp. 21–48).
Cambridge: Cambridge University Press.
146
Backus, A. (2013). A usage-based approach to borrowability. In E. Zenner & G.
Kristiansen (Eds.), New perspectives on lexical borrowing (pp. 19–39). Berlin:
Mouton de Gruyter.
Backus, A. (2015). Rethinking Weinreich, Labov & Herzog from a usage-based
perspective: Contact-induced change in Dutch Turkish. Taal en tongval:
Tijdschrift
voor
taalvariatie, 67(2),
275-306.
https://doi.org/10.5117/TET2015.2.BACK
Backus, A. & Mos, M. (2011). Islands of (im)productivity in corpus data and
acceptability judgments: Contrasting two potentiality constructions in Dutch. In:
D. Schönefeld (Ed.), Converging Evidence (pp. 165-192). Amsterdam: John
Benjamins.
Bader, M. & Häussler, J. (2010). Toward a model of grammaticality judgments.
Journal of Linguistics, 46, 273–330.
Baker, F. B. & Seock-Ho, K. (2004). Item response theory: Parameter estimation
techniques. New York: Dekker.
Balota, D. A., Cortese, M. J., Sergent-Marshall, S. D., Spieler, D. H. & Yap, M. J.
(2004). Visual word recognition for single syllable words. Journal of
Experimental Psychology General, 133(2), 283-316.
Balota, D. A., Yap, M. J., Cortese, M. J., Hutchison, K. A., Kessler, B., Loftis, B., Neely,
J. H., Nelson, D. L., Simpson, G. B. & Treiman, R. (2007). The English lexicon
project. Behavior Research Methods, 39, 445-459.
Balota, D. A., Pilotti, M. & Cortesem, M. J. (2001). Subjective frequency estimates
for 2,938 monosyllabic words. Memory & Cognition, 29(4), 639–647.
Bannard, C. & Matthews, D. (2008). Stored word sequences in language learning:
The effect of familiarity on children’s repetition of four-word combinations.
Psychological Science, 19, 241–248.
Bar, M. (2007). The proactive brain: using analogies and associations to generate
Trends
in
Cognitive
Sciences,
11,
280–289.
predictions.
doi:10.1016/j.tics.2007.05.005
Bar, M., Neta, M. & Linz, H. (2006). Very first impressions. Emotion, 6(2), 269–
278.
Bard, E., Robertson, D. & Sorace, A. (1996). Magnitude estimation of linguistic
acceptability. Language, 72, 32-68.
Barking, M., Backus, A. & Mos, M. (submitted). Comparing forward and reverse
transfer from Dutch to German.
Barlow, M. (2013). Individual differences and usage-based grammar. International
Journal of Corpus Linguistics, 18(4), 443–478.
Barlow, M. & Kemmer, S. (2000). Usage-based models of language. Cambridge:
Cambridge University Press.
Barr, D. J., Levy, R., Scheepers, C. & Tily, H. J. (2013). Random effects structure
for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and
Language, 68(3), 255-278.
Barth, D. & Kapatsinski, V. (2014). A multimodel inference approach to categorical
variant choice: construction, priming and frequency effects on the choice
between full and contracted forms of am, are and is. Corpus Linguistics and
Linguistic Theory, 13(2), 203-260. doi:10.1515/cllt-2014-0022.
References
147
Bates, D. M., Mächler, M., Bolker, B. M. & Walker, S. C. (2015). Fitting linear mixedeffects models using lme4. Journal of Statistical Software, 67(1), 1-48.
Birdsong, D. (1989). Metalinguistic performance and interlinguistic competence.
New York: Springer.
Blasko, D. G. & Connine, C. M. (1993). Effects of familiarity and aptness on
metaphor processing. Journal of Experimental Psychology: Learning, Memory,
and Cognition, 19(2), 295-308.
Blom, W. B. T., Paradis, J. & Sorenson Duncan, T. (2012). Effects of input
properties, vocabulary size, and L1 on the development of third person singular
-s in child L2 English. Language Learning, 62(3), 965-994.
Blom, E. & Unsworth, S. (2010). Experimental methods in language acquisition
research. Amsterdam: Benjamins.
Boersma, P. & Weenink, D. (2015). Praat: doing phonetics by computer [Computer
program]. Version 5.4.06, retrieved 21 February 2015 from
http://www.praat.org/
Borensztajn, G., Zuidema, W. & Bod, R. (2009). Children’s grammars grow more
abstract with age -Evidence from an automatic procedure for identifying the
productive units of language. Topics in Cognitive Science, 1, 175-188.
doi:10.1111/j.1756-8765.2008.01009.x
Bornkessel-Schlesewsky, I. & Schlesewsky, M. (2007). The wolf in sheep’s
clothing: Against a new judgment-driven imperialism. Theoretical Linguistics,
33(3), 319-333.
Branigan, H. P. & Pickering, M. J. (2017). An experimental approach to linguistic
representation.
Behavioral
and
Brain
Sciences,
40,
1-73.
doi:10.1017/S0140525X16002028
Brothers, T., Swaab, T. Y. & Traxler, M. J. (2015). Effects of prediction and
contextual support on lexical processing: Prediction takes precedence.
Cognition, 136, 135–149.
Brothers, T., Swaab, T. Y. & Traxler, M. J. (2017). Goals and strategies influence
lexical prediction during sentence comprehension. Journal of Memory and
Language, 93, 203-216.
Bryman, A. (2004) Triangulation. In M. Lewis-Beck, A. Bryman & T. F. Liao (Eds.),
The Sage encyclopedia of social science research methods (p. 1143).
doi:10.4135/9781412950589.n1031
Buchstaller, I. & Khattab, G. (2013). Population samples. In R. J. Podesva & D.
Sharma (Eds.), Research methods in linguistics (pp. 74-95). Cambridge:
Cambridge University Press.
Burmester, J., Spalek, K. & Wartenburger, I. (2014). Context updating during
sentence comprehension: The effect of aboutness topic. Brain and Language,
137, 62-76.
Bybee, J. (2002). Phonological evidence for exemplar storage of multiword
sequences. Studies in Second Language Acquisition, 24, 215-221.
Bybee, J. (2006). From usage to grammar: the mind’s response to repetition.
Language, 82, 529–551.
Bybee, J. (2007). Frequency of use and the organization of language. Oxford:
Oxford University Press.
148
Bybee, J. (2010). Language, usage and cognition. Cambridge: Cambridge
University Press.
Bybee, J. & Scheibman, J. (1999). The effect of usage on degrees of constituency:
The reduction of ‘don’t’ in English. Linguistics, 37, 575–596.
Cacioppo, J. T., Petty, R. E. & Kao, C. F. (1984). The efficient assessment of need
for cognition. Journal of Personality Assessment, 48(3), 306-307.
Caldwell-Harris, C., Berant, J. & Edelman, Sh. (2012). Measuring mental
entrenchment of phrases with perceptual identification, familiarity ratings, and
corpus frequency statistics. In D. Divjak & S. Gries (Eds.), Frequency effects in
language representation (pp. 165-194). Berlin: Mouton de Gruyter.
Camblin, C., Gordon, P. & Swaab, T. (2007). The interplay of discourse congruence
and lexical association during sentence processing: Evidence from ERPs and
eye tracking. Journal of Memory and Language, 56(1), 103-128.
Carlsson, K., Petrovic, P., Skare, S., Petersoon, K. M. & Ingvar, M. (2000). Tickling
expectations: neural processing in anticipation of a sensory stimulus. Journal
of Cognitive Neuroscience, 12(4), 691–703.
Cattell, R. B. (1971). Abilities: Their structure, growth and action. Boston:
Houghton-Mifflin.
Chaudron, C. (1983). Research on metalinguistic judgments: A review of theory,
methods, and results. Language Learning, 33(3), 343–377.
Chen, S. & Goodman, J. (1999). An empirical study of smoothing techniques for
language modeling. Computer Speech and Language, 13, 359–394.
Chen, P. & Popovich, P. (2002). Correlation: Parametric and nonparametric
measures. Thousand Oaks, CA: Sage.
Chomsky, N. (1965). Aspects of the theory of syntax. Cambridge: MIT Press.
Christiansen, M. & Chater, N. (2008). Language as shaped by the brain.
Behavioral and Brain Sciences 31, 489-558.
Christiansen, M. & Chater, N. (2016). The Now-or-Never bottleneck: A fundamental
constraint on language. Behavioral and Brain Sciences, 39, E62.
doi:10.1017/S0140525X1500031X
Church, K. & Gale, W. (1995). Poisson mixtures. Journal of Natural Language
Engineering, 1(2), 163–190.
Churchill, G. & Peter, J. P. (1984). Research design effects on the reliability of
rating scales: A meta-analysis. Journal of Marketing Research, 21(4), 360–375.
Clark, A. (2013). Whatever next? Predictive brains, situated agents, and the future
of cognitive science. Behavioral and Brain Sciences, 36(3), 181-204.
doi:10.1017/S0140525X12000477
Clark, H. (1973). The language-as-fixed-effect fallacy: A critique of language
statistics in psychological research. Journal of Verbal Learning and Verbal
Behavior, 12, 335–359.
Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45(12),
1304-1312.
Colman, A. M., Norris, C. E. & Preston, C. C. (1997). Comparing rating scales of
different lengths: Equivalence of scores from 5-point and 7-point scales.
Psychological Reports, 80(2), 355–362.
References
149
Connine, C. M., Mullennix, J., Shernoff, E. & Yelen, J. (1990). Word familiarity and
frequency in visual and auditory word recognition. Journal of Experimental
Psychology: Learning, Memory & Cognition, 16, 1084-1096.
Crain, S. & Lillo-Martin, D. (1999). An introduction to linguistic theory and language
acquisition. Malden: Blackwell.
Croft, W. (2000). Explaining language change: An evolutionary approach. London:
Longman.
Cronbach, L. J. (1951). Coefficient alpha and the internal structure of tests.
Psychometrika, 16(3), 297–334.
Cronk, B. C., Lima, S. D. & Schweigert, W. A. (1993). Idioms in sentences: Effects
of frequency, literalness, and familiarity. Journal of Psycholinguistic Research,
22(1), 59-82.
Cumming, G. (2014). The new statistics: Why and how. Psychological Science,
25(1), 7–29.
Dąbrowska, E. (2008). The effects of frequency and neighbourhood density on
adult speakers' productivity with Polish case inflections: An empirical test of
usage-based approaches to morphology. Journal of Memory and Language, 58,
931-951.
Dąbrowska, E. (2010). Naive v. expert competence: An empirical study of speaker
intuitions. The Linguistic Review, 27, 1–23.
Dąbrowska, E. (2010). The mean lean grammar machine meets the human mind:
Empirical investigations of the mental status of rules. In H.-J. Schmid & S. Handl
(Eds.), Cognitive foundations of linguistic usage patterns. Empirical approaches
(pp. 151–170). Berlin: Mouton de Gruyter.
Dąbrowska, E. (2012). Different speakers, different grammars: Individual
differences in native language attainment. Linguistic Approaches to
Bilingualism, 2(3), 219-253.
Dąbrowska, E. (2013). Functional constraints, usage, and mental grammars: A
study of speakers' intuitions about questions with long-distance dependencies.
Cognitive Linguistics, 24(4), 633–665.
Dąbrowska, E. (2014). Recycling utterances: A speaker's guide to sentence
processing. Cognitive Linguistics, 25(4), 617–653.
Dąbrowska, E. (2016a). Cognitive Linguistics’ seven deadly sins. Cognitive
Linguistics, 27(4), 479–491.
Dąbrowska, E. (2016b). Language in the mind and in the community. In J. Daems,
E. Zenner, K. Heylen, D. Speelman & H. Cuyckens (Eds.), Change of paradigms
– New paradoxes (pp. 221–235). Berlin: Walter de Gruyter.
Dąbrowska, E. (2018). Experience, aptitude and individual differences in native
language ultimate attainment. Cognition, 178, 222-235.
Dąbrowska, E. & Lieven, E. (2005). Towards a lexically specific grammar of
children's question constructions. Cognitive Linguistics, 16(3), 437-474.
Dambacher, M., Kliegl, R., Hofmann, M. & Jacobs, A. M. (2006). Frequency and
predictability effects on event-related potentials during reading. Brain Research,
1084(1), 89–103.
150
Daneman, M. E. & Carpenter, P. A. (1980). Individual differences in working
memory and reading. Journal of Verbal Learning and Verbal Behavior, 19, 450466.
De Deyne, S. & Storms, G. (2008). Word associations: Norms for 1,424 Dutch
words in a continuous task. Behavior Research Methods, 40, 198-205.
De Bot, K. & Schrauf, R. (2009). Language development over the lifespan. New
York: Routledge.
DeLong, K., Urbach, T. & Kutas, M. (2005). Probabilistic word pre-activation during
language comprehension inferred from electrical brain activity. Nature
Neuroscience, 8(8), 1117-1121.
Diessel, H. (2007). Frequency effects in language acquisition, language use, and
diachronic change. New Ideas in Psychology, 25(2), 108–127.
Divjak, D. (2016). The role of lexical frequency in the acceptability of syntactic
variation. Evidence from that-clauses in Polish. Cognitive Science, 41(2), 354382. doi:10.1111/cogs.12335.
Dunn, L. M. & Dunn, L. M. (1997). Peabody Picture Vocabulary Test (3rd ed.). Circle
Pines, MN: American Guidance Service.
Eckert, P. (1997). Age as a sociolinguistic variable. In Florian Coulmas (Ed.),
Handbook of sociolinguistics (pp. 151–167). Oxford: Blackwell.
Eckert, P. (2012). Three waves of variation study: The emergence of meaning in
the study of variation. Annual Review of Anthropology, 41, 87–100.
Eckert, P. (2013). Ethics in linguistic research. In R.J. Podesva & D. Sharma (Eds.),
Research methods in linguistics (pp. 11-26). Cambridge: Cambridge University
Press.
Ellis, N. C. (2002). Frequency effects in language processing: A review with
implications for theories of implicit and explicit language acquisition. Studies in
Second Language Acquisition, 24(2), 143–188.
Ellis, N. C. & Simpson-Vlach, R. (2009). Formulaic language in native speakers:
Triangulating psycholinguistics, corpus linguistics and education. Corpus
Linguistics and Linguistic Theory, 5(1), 61-78.
Ellis, R. (1991). Grammaticality judgments and second language acquisition.
Studies in Second Language Acquisition, 13(2), 161–186.
Ellis, R. (2005). Measuring implicit and explicit knowledge of a second language:
A psychometric study. Studies in Second Language Acquisition, 27, 141–172.
Featherston, S. (2007). Data in generative grammar: The stick and the carrot.
Theoretical Linguistics, 33, 269–318.
Fernandez Monsalve, I., Frank, S.L. & Vigliocco, G. (2012). Lexical surprisal as a
general predictor of reading time. Proceedings of the 13th Conference of the
European Chapter of the Association for Computational Linguistics (pp. 398408). Avignon, France: Association for Computational Linguistics.
Ferrand, L., Brysbaert, M., Keuleers, E., New, B., Bonin, P., Méot, A., Augustinova,
M. & Pallier, C. (2011). Comparing word processing times in naming, lexical
decision, and progressive demasking: evidence from Chronolex. Frontiers in
Psychology, 2(306), 1-10.
Field, A. (2013). Discovering statistics using IBM SPSS Statistics: and sex and
drugs and rock 'n' roll. Los Angeles, CA: Sage.
References
151
Field, A., Miles, J. & Field, Z. (2012). Discovering statistics using R. London:
Thousand Oaks.
Fitzpatrick, T., Playfoot, D., Wray, A. & Wright, M. (2015). Establishing the
Reliability of Word Association Data for Investigating Individual and Group
Differences. Applied Linguistics, 36, 23-50.
Flynn, S. (1986). Production vs. comprehension: Differences in underlying
competences. Studies in Second Language Acquisition, 8, 135–164.
Forster, K. & Chambers, S. (1973). Lexical access and naming time. Journal of
Verbal Learning and Verbal Behavior, 12(6), 627–635.
Foulkes, P. (2006). Phonological variation: A global perspective. In B. Aarts & A.
McMahon (Eds.), The Handbook of English Linguistics (pp. 625–669). Oxford,
UK: Blackwell.
Frank, S. L. (2013). Uncertainty reduction as a measure of cognitive load in
sentence comprehension. Topics in Cognitive Science, 5, 475-494.
Frank, S. L., Otten, L. J., Galli, G. & Vigliocco, G. (2015). The ERP response to the
amount of information conveyed by words in sentences. Brain & Language, 140,
1–11.
Gardner, M. K., Rothkopf, E. Z., Lapan, R. & Lafferty, T. (1987). The word frequency
effect in lexical decision: Finding a frequency-based component. Memory and
Cognition, 15, 24–28.
Garrod, S. & Pickering, M. J. (2004). Why is conversation so easy? Trends in
Cognitive Sciences, 8(1), 8–11.
Gernsbacher, M. A. (1984). Resolving 20 years of inconsistent interactions
between lexical familiarity and orthography, concreteness, and polysemy.
Journal of Experimental Psychology: General, 113, 256–281.
Gibbs, R. Jr. (2006). Introspection and cognitive linguistics: Should we trust our
own intuitions? Annual Review of Cognitive Linguistics, 4, 135-151.
Gibson, E. & Fedorenko, E. (2010). Weak quantitative standards in linguistics
research. Trends in Cognitive Sciences, 14, 233–234.
Gibson, E. & Fedorenko, E. (2013). The need for quantitative methods in syntax
and semantics research. Language and Cognitive Processes, 28(1), 88–124.
Gilquin, G. & Gries, S. Th. (2009). Corpora and experimental methods: A state-ofthe-art review. Corpus Linguistics and Linguistic Theory, 5(1), 1–26.
Goldberg, A. E. (2006). Constructions at work. The nature of generalization in
language. Oxford: Oxford University Press.
Goldinger, S. D. (1996). Words and voices: Episodic traces in spoken word
identification and recognition memory. Journal of Experimental Psychology:
Learning, Memory, and Cognition, 22, 1166–1183.
Gómez, R. (2002). Variability and detection of invariant structure. Psychological
Science, 13, 431-436.
Goudbeek, M., Swingley, D. & Smits, R. (2009). Supervised and unsupervised
learning of multidimensional acoustic categories. Journal of Experimental
Psychology: Human Perception and Performance, 35(6), 1913–1933.
Granger, S. (1998). Prefabricated patterns in advanced EFL writing: Collocations
and formulae. In A. P. Cowie (Ed.), Phraseology: Theory, analysis, and
applications (pp. 145–160). Oxford: Oxford University Press.
152
Gries, S. Th. (2008). Dispersions and adjusted frequencies in corpora.
international Journal of Corpus Linguistics, 13(4), 403-437.
Gries, S. Th. (2010a). Useful statistics for corpus linguistics. In A. Sánchez & M.
Almela (Eds.), A mosaic of corpus linguistics: selected approaches (pp. 269291). Frankfurt am Main: Peter Lang.
Gries, S. Th. (2010b). Behavioral profiles: a fine-grained and quantitative approach
in corpus-based lexical semantics. The Mental Lexicon, 5(3), 323-346.
Gries, S. Th. (2012). Frequencies, probabilities, association measures in usage/exemplar-based linguistics: some necessary clarifications. Studies in
Language, 36(3), 477-510.
Gries, S. Th. (2014). Quantitative corpus approaches to linguistic analysis: seven
or eight levels of resolution and the lessons they teach us. In I. Taavitsainen, M.
Kytö, C. Claridge & J. Smith (Eds.), Developments in English: expanding
electronic evidence (pp. 29–47). Cambridge: Cambridge University Press.
Gries, S. Th. (2015). Quantitative methods in linguistics. In J. D. Wright (Ed.),
International Encyclopedia of the Social and Behavioral Sciences, 2nd edn., vol.
19 (pp. 725–732). Amsterdam: Elsevier.
Gries, S. T. & Divjak, D. (2012.). Frequency effects in language representation.
Retrieved from https://ebookcentral.proquest.com
Gries, S. Th. & Newman, J. (2013). Creating and using corpora. In R.J. Podesva &
D. Sharma (Eds.), Research methods in linguistics (pp. 257-287). Cambridge:
Cambridge University Press.
Gries, S. Th. & Wulff, S. (2009). Psycholinguistic and corpus linguistic evidence
for L2 constructions. Annual Review of Cognitive Linguistics, 7, 163–186.
Griffin, Z.M. & Bock, K. (1998). Constraint, word frequency, and the relationship
between lexical processing levels in spoken word production. Journal of
Memory and Language, 38, 313-338.
Gupta, P. (2003). Examining the relationship between word learning, nonword
repetition, and immediate serial recall in adults. Quarterly Journal of
Experimental Psychology: Human Experimental Psychology, 56, 1213-1236.
doi:10.1080/02724980343000071
Gutiérrez-Clellen, V. & Kreiter, J. (2003). Understanding child bilingual acquisition
using parent and teacher reports. Applied Psycholinguistics, 24(2), 267-288.
Hammersley, M. (2008). Troubles with triangulation. In M. M. Bergman (Ed.),
Advances in mixed methods research (pp. 22–36). London: SAGE.
Hashemi, M. R. & Babaii, E. (2013). Mixed methods research: Toward new
research designs in applied linguistics. The Modern Language Journal, 97(4),
828–852.
Henrich, J., Heine, S. & Norenzayan, A. (2010). The weirdest people in the world?
Behavioral and Brain Sciences, 33, 61–83. doi:10.1017/S0140525X0999152X
Hintzman, D. L. (1986). "Schema abstraction" in a multiple-trace memory model.
Psychological Review, 93(4), 411–428.
Hintzman, D. L. (2011). Research strategy in the study of memory: Fads, fallacies,
and the search for the “coordinates of truth”. Perspectives on Psychological
Science, 6(3), 253–271.
References
153
Howes, D. & Solomon, R. (1951). Visual duration threshold as a function of wordprobability. Journal of Experimental Psychology, 41(6), 401-410.
Huettig, F. (2015). Four central questions about prediction in language
processing. Brain Research, 1626, 118–135.
Hulstijn, J. (2010). Measuring second language proficiency. In E. Blom & S.
Unsworth (Eds.), Experimental Methods in Language Acquisition Research (pp.
185-200). Amsterdam: Benjamins.
Isakov, A., Holcomb, A., Glowacki, L. & Christakis, N. A. (2016). Modeling the role
of networks and individual differences in inter-group violence. PloS ONE, 11(2),
e0148314. https://doi.org/10.1371/journal.pone.0148314
Jaeger, T. F. (2008). Categorical data analysis: Away from ANOVAs
(transformation or not) and towards logit mixed models. Journal of Memory
and Language, 59(4), 434–446.
Janda, L. A. (2013). Quantitative methods in cognitive linguistics: An introduction.
In L. A. Janda (Ed.), Cognitive Linguistics: The quantitative turn (pp. 1–32).
Berlin: De Gruyter Mouton.
Janssen, N. & Barber, H.A. (2012). Phrase frequency effects in language
production. PLoS ONE, 7(3): e33202. doi:10.1371/journal.pone.0033202
Jarvis, S. (2013). Capturing the diversity in lexical diversity. Language Learning,
63, 87-106.
Jiang, X. L. & Cillessen, A. (2005). Stability of continuous measures of sociometric
status: A meta-analysis. Developmental Review, 25(1), 1–25.
Johnson, P. C. D. (2014). Extension of Nakagawa & Schielzeth’s R 2GLMM to random
slopes models. Methods in Ecology and Evolution, 5, 944–946.
doi:10.1111/2041-210X.12225
Johnson, J. S., Shenkman, K. D., Newport, E. L. & Medin, D. L. (1996).
Indeterminacy in the grammar of adult language learners. Journal of Memory
and Language, 35(3), 335–352.
Jolsvai, H., McCauley, S. M. & Christiansen, M. H. (2013). Meaning overrides
frequency in idiomatic and compositional multiword chunks. Proceedings of the
Annual Meeting of the Cognitive Science Society, 35. Retrieved from
https://escholarship.org/uc/item/5cv7b5xs
Juhasz, B. J., Lai, Y.-H. & Woodcock, M. L. (2015). A database of 629 English
compound words: Ratings of familiarity, lexeme meaning dominance, semantic
transparency, age of acquisition, imageability, and sensory experience. Behavior
Research Methods, 47(4), 1004-1019.
Juhasz, B. J. & Rayner, K. (2003). Investigating the effects of a set of
intercorrelated variables on eye fixation durations in reading. Journal of
Experimental Psychology: Learning, Memory, and Cognition, 29, 1312–1318.
Jurafsky, D., Bell, A., Gregory, M. & Raymond, W. (2001). Probabilistic relations
between words: Evidence from reduction in lexical production. In J. Bybee & P.
Hopper (Eds.), Frequency and the emergence of linguistic structure (pp. 229–
254_. Amsterdam: John Benjamins.
Kaiser, E. (2013). Experimental paradigms in psycholinguistics. In R. Podesva &
D. Sharma (Eds.), Research methods in linguistics (pp. 135-168). Cambridge:
Cambridge University Press.
154
Kamoen, N. (2012). Positive versus negative: A cognitive perspective on wording
effects for contrastive questions in attitude surveys. Utrecht: LOT dissertation.
Keller, F. & Alexopoulou, T. (2001). Phonology competes with syntax:
Experimental evidence for the interaction of word order and accent placement
in the realization of information structure. Cognition, 79, 301–372.
Kemp, N., Mitchell, P. & Bryant, P. (2017). Simple morphological spelling rules are
not always used: Individual differences in children and adults. Applied
Psycholinguistics, 38, 1071-1094. doi:10.1017/S0142716417000042
Kertész, A., Schwarz-Friesel, M. & Consten, M. (2012). Introduction: converging
data sources in cognitive linguistics. Language Sciences, 34(6), 651–655.
Keuleers, E. & Balota, D. A. (2015). Megastudies, crowdsourcing, and large
datasets in psycholinguistics: an overview of recent developments introduction.
Quarterly Journal of Experimental Psychology, 68(8), 1457-1468.
Keuleers, E., Diependaele, K. & Brysbaert, M. (2010). Practice effects in large-scale
visual word recognition studies: A lexical decision study on 14,000 Dutch monoand disyllabic words and nonwords. Frontiers in Psychology, 1, 174-189.
http://doi.org/10.3389/fpsyg.2010.00174
Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus
Linguistics, 6(1), 1-37.
Kirsner, K. (1994). Implicit processes in second language learning. In N. Ellis (Ed.),
Implicit and explicit learning of languages (pp. 283–312). San Diego, CA:
Academic Press.
Kliegl, R., Grabner, E., Rolfs, M. & Engbert, R. (2004). Length, frequency, and
predictability effects of words on eye movements in reading. European Journal
of Cognitive Psychology, 16(1–2), 262–284.
Koole, S. L. & Lakens, D. (2012). Rewarding replications: A sure and simple way
to improve psychological science. Perspectives on Psychological Science, 7,
608–614. doi:10.1177/1745691612462586
Kristiansen, G. & Dirven, R. (2008). Cognitive sociolinguistics: language variation,
cultural models, social systems. Berlin: Mouton de Gruyter.
Kuhl, P. (2000). A new view of language acquisition. Proceedings of the National
Academy of Sciences of the United States of America, 97(22), 11850–11857.
Kuperberg, G. R. & Jaeger, T. F. (2016). What do we mean by prediction in
language comprehension? Language, Cognition and Neuroscience, 31(1), 3259. doi:10.1080/23273798.2015.1102299
Kutas, M., DeLong, K. A. & Smith, N. J. (2011). A look around at what lies ahead:
Prediction and predictability in language processing. In M. Bar (Ed.), Predictions
in the Brain: Using Our Past to Generate a Future (pp. 190–207). New York:
Oxford University Press.
Labov, W. (n.d.). Some observations on the foundations of linguistics. Retrieved
from http://www.ling.upenn.edu/~wlabov/Papers/Foundations.html (11 July,
2012).
Labov, W. (1966). The social stratification of English in New York City.
Washington: Center for Applied Linguistics.
Labov, W. (2001). Principles of linguistic change. Vol. 2: Social factors. Oxford:
Blackwell.
References
155
Lakoff, G. (1987). Women, fire, and dangerous things: What categories reveal
about the mind. Chicago: Chicago University Press.
Langacker, R. (1987). Foundations of Cognitive Grammar, Vol. I. Stanford:
Stanford University Press.
Langacker, R. (2000). A dynamic usage-based model. In M. Barlow & S. Kemmer
(Eds.), Usage-based models of language (pp. 1-63). Stanford, CA: CSLI
Publications.
Langsford, S., Perfors, A., Hendrickson, A., Kennedy, L. & Navarro, D. (2018).
Quantifying sentence acceptability measures: Reliability, bias, and variability.
Glossa: A Journal of General Linguistics, 3(1), 1–34.
Levy, R. (2008). Expectation-based syntactic comprehension. Cognition, 106(3),
1126-1177. doi:10.1016/j.cognition.2007.05.006
Levon, N. (2013). Surveys and interviews. In R.J. Podesva & D. Sharma (Eds.),
Research methods in linguistics (pp. 96-115). Cambridge: Cambridge University
Press.
Lieven, E., Salomo, D. & Tomasello, M. (2009). Two-year-old children’s production
of multiword utterances: A usage-based analysis. Cognitive Linguistics, 20(3),
481-507.
Lilly, B. (2009). Optimizing stimuli order in marketing experiments: A comparison
of four orders using six criteria. Journal of Targeting, Measurement and
Analysis for Marketing, 17, 245-255. doi:10.1057/jt.2009.17
Linzen, T. & Jaeger, F. (2016). Uncertainty and expectation in sentence
processing: evidence from subcategorization distributions. Cognitive Science,
40(6), 1382–1411.
Lüdeling, A. & Kytö, M. (2008). Corpus Linguistics: An International Handbook.
Volume 1. Berlin: Mouton de Gruyter.
Luka, B. & Barsalou, L. (2005). Structural facilitation: Mere exposure effects for
grammatical acceptability as evidence for syntactic priming in
comprehension. Journal of Memory and Language, 52(3), 436-459.
Mandera, P., Keuleers, E. & Brysbaert, M. (2017). Explaining human performance
in psycholinguistic tasks with models of semantic similarity based on prediction
and counting: A review and empirical validation. Journal of Memory and
Language, 92, 57–78.
Matuschek, H., Kliegl, R., Vasishth, S., Baayen, R. H., & Bates, D. (2017). Balancing
Type I Error and Power in Linear Mixed Models. Journal of Memory and
Language, 94, 305-315.
Maxwell, S. E., Kelley, K. & Rausch, J. R. (2008). Sample size planning for
statistical power and accuracy in parameter estimation. Annual Review of
Psychology, 59, 537–63.
McCauley, S. & Christiansen, M. (2014). Acquiring formulaic language: A
computational model. The Mental Lexicon, 9, 419-436.
McCauley, S., Isbilen, E. & Christiansen, M. (2017). Chunking ability shapes
sentence processing at multiple levels of abstraction. In G. Gunzelmann, A.
Howes, T. Tenbrink & E. J. Davelaar (Eds.), Proceedings of the 39th Annual
Conference of the Cognitive Science Society (pp. 2681– 2686). Austin, TX:
Cognitive Science Society.
156
McDonald, S. A. & Shillcock, R. C. (2003). Eye movements reveal the on-line
computation of lexical probabilities during reading. Psychological Science,
14(6), 648-652.
McEvoy, C. L. & Nelson, D. L. (1982). Category name and instance norms for 106
categories of various sizes. American Journal of Psychology, 95, 581-634.
McNamara, T. P. (2005). Semantic priming: Perspectives from memory and word
recognition. Hove, England: Psychology Press.
McRae, K., Cree, G. S., Seidenberg, M. S. & McNorgan, C. (2005). Semantic feature
production norms for a large set of living and nonliving things. Behavior
Research Methods, 37(4), 547-559.
Meng, M. & Bader, M. (2000). Ungrammaticality detection and garden-path
strength: Evidence for serial parsing. Language and Cognitive Processes, 15,
615–666.
Misyak, J. B. & Christiansen, M. H. (2012). Statistical learning and language: an
individual differences study. Language Learning, 62(1), 302–331.
doi:10.1111/j.1467-9922.2010.00626.x
Misyak, J. B., Christiansen, M. H. & Tomblin, J. B. (2010). Sequential expectations:
The role of prediction-based learning in language. Topics in Cognitive Science,
2, 138–153. doi:10.1111/j.1756-8765.2009.01072.x
Mos, M., van den Bosch, A. & Berck, P. (2012). The predictive value of word-level
perplexity in human sentence processing: A case study on fixed adjectivepreposition constructions in Dutch. In S. Th. Gries & D. Divjak (Eds.), Frequency
effects in language learning and processing (pp. 207–239). Berlijn: De Gruyter.
Myers, J., Well, A. & & Lorch, R. (2010). Research design and statistical analysis.
New York: Routledge.
Nelson, D. L., McEvoy, C. L. & Schreiber, T. A. (1998). The University of South
Florida word association, rhyme, and word fragment norms.
http://www.usf.edu/FreeAssociation/
Nordquist, D. (2009). Investigating elicited data from a usage-based perspective.
Corpus Linguistics and Linguistic Theory, 5(1), 105–130.
Nosek, B. & Lakens, D. (2014). Registered reports: A method to increase the
credibility of published results. Social Psychology, 45(3), 137-141.
Nurius, P. S. & Macy, R. J. (2008). Heterogeneity among violence-exposed women:
Applying person-oriented research methods. Journal of Interpersonal
Violence, 23(3), 389-415.
Olsthoorn, N., Andringa, S. & Hulstijn, J. (2014). Visual and auditory digit-span
performance in native and nonnative speakers. International Journal of
Bilingualism, 18(6), 663-673.
Oostdijk, N., Reynaert, M., Hoste, V. & Schuurman, I. (2013). The construction of
a 500-million-word reference corpus of contemporary written Dutch. In P. Spyns
& J. Odijk (eds.), Essential speech and language technology for Dutch: Theory
and applications of natural language processing (pp. 219–247). Dordrecht:
Springer.
Otten, M. & Van Berkum, J. (2008). Discourse-based anticipation during language
processing: Prediction or priming? Discourse Processes, 45, 464–496.
References
157
Paiva, C., Barroso, E., Carneseca, E., de Pádua Souza, C., Thomé dos Santos, F.,
López, R. & Sakamoto Ribeiro Paiva, B. (2014). A critical analysis of test-retest
reliability in instrument validation studies of cancer patients under palliative
care: a systematic review. BMC Medical Research Methodology, 14(1), 8–18.
Paradis, J. (2010). Comparing typically-developing children and children with
specific language impairment. In E. Blom & S. Unsworth (Eds.), Experimental
Methods in Language Acquisition Research (pp. 223-244). Amsterdam:
Benjamins.
Pearl, L. (2010). Using computational modeling in language acquisition research.
In E. Blom & S. Unsworth (Eds.), Experimental Methods in Language Acquisition
Research (pp. 163-184). Amsterdam: Benjamins.
Pickering, M. J. & Ferreira, V. S. (2008). Structural priming: A critical review.
Psycholinguistic Bulletin, 134(3), 427–459.
Pierrehumbert, J. B. (2001). Exemplar dynamics: word frequency, lenition and
contrast. In J. Bybee & P. Hopper (Eds), Frequency and the emergence of
linguistic structure (pp. 137–157). Amsterdam: John Benjamins.
Podesva, R. J. & Sharma, D. (2013). Research methods in linguistics. Cambridge:
Cambridge University Press.
Popiel, S. J. & McRae, K. (1988). The figurative and literal senses of idioms; or, all
idioms are not used equally. Journal of Psycholinguistic Research, 17, 475–487.
Preston, C. C. & Colman, A. M. (2000). Optimal number of response categories in
rating scales: Reliability, validity, discriminating power, and respondent
preferences. Acta Psychologica, 104(1), 1–15.
Princeton
University
(2010).
About
WordNet. Retrieved
from
https://wordnet.princeton.edu/
R Core Team (2015). R: A language and environment for statistical computing.
Vienna: R Foundation for Statistical Computing. Available at: http://www.Rproject.org/
R Core Team (2017). R: A language and environment for statistical computing.
Vienna, Austria: R Foundation for Statistical Computing. http://www.Rproject.org/
Rastle, K., Harrington, J. & Coltheart, M. (2002). 358,534 nonwords: The ARC
Nonword Database. Quarterly Journal of Experimental Psychology, 55(4), 13391362.
Raven, J., Raven, J. C. & Court, J. H. (1998). Raven manual section 4: Advanced
progressive matrices. Oxford, UK: Oxford Psychologists Press.
Rayner, K., Ashby, J., Pollatsek, A. & Reichle, E. D. (2004). The effects of frequency
and predictability on eye fixations in reading: Implications for the E-Z reader
model. Journal of Experimental Psychology: Human Perception and
Performance, 30(4), 720–732.
Rayson, P. & Garside, R. (2000). Comparing corpora using frequency
profiling. Proceedings of the Workshop on Comparing Corpora, held in
conjunction with The 38th Annual Meeting of the Association for Computational
Linguistics, 1-6.
Roark, B., Bachrach, A., Cardenas, C. & Pallier, C. (2009). Deriving lexical and
syntactic expectation-based measures for psycholinguistic modeling via
158
incremental top-down parsing. Proceedings of the 2009 Conference on
Empirical Methods in Natural Language Processing (Singapore), 324–333.
Roehr, K. (2008). Linguistic and metalinguistic categories in second language
learning. Cognitive Linguistics, 19, 67–106.
Roland, D., Yun, H., Koenig, J.-P. & Mauner, G. (2012). Semantic similarity,
predictability, and models of sentence processing. Cognition, 122, 267–279.
Saffran, J. R. (2003). Statistical language learning: Mechanisms and constraints.
Current Directions in Psychological Science, 12, 110-114.
Sampson, G. R. (2007). Grammar without grammaticality. Corpus Linguistics and
Linguistic Theory, 3(1), 1-32.
Sankoff, G. (2006). Age: Apparent time and real time. In K. Brown (Ed.), The
encyclopedia of language and linguistics, 2nd edn., vol. 1 (pp. 110–116). Oxford:
Elsevier.
Schäfer, R. & Bildhauer. F. (2012). Building large corpora from the web using a
new efficient tool chain. Proceedings of the Eighth International Conference on
Language Resources and Evaluation (LREC’12) (pp. 486–493). Istanbul,
Turkey: European Language Resources Association.
Schilling, N. (2013). Surveys and interviews. In R.J. Podesva & D. Sharma (Eds.),
Research methods in linguistics (pp. 96-115). Cambridge: Cambridge University
Press.
Schmid, H.-J. (2007). Entrenchment, salience and basic levels. In D. Geeraerts &
H. Cuyckens (Eds.), The Oxford handbook of cognitive linguistics (pp. 117-138).
Oxford: Oxford University Press.
Schmid, H.-J. (2010). Does frequency in text instantiate entrenchment in the
cognitive system? In D. Glynn & K. Fischer (Eds.), Quantitative methods in
cognitive semantics: Corpus-driven approaches (pp. 101-133). Berlin: Mouton
de Gruyter.
Schmid, H.-J. (2015). A blueprint of the Entrenchment-and-Conventionalization
Model. Yearbook of the German Cognitive Linguistics Association, 3, 1-27.
Schmid, H-J. & Küchenhoff, H. (2013). Collostructional analysis and other ways
of measuring lexicogrammatical attraction: Theoretical premises, practical
problems and cognitive underpinnings. Cognitive Linguistics, 24(3), 531-577.
Schmid, H.-J. & Mantlik, A. (2015). Entrenchment in historical corpora?
Reconstructing dead authors’ minds from their usage profiles. Anglia, 133(4),
583–623. doi:10.1515/ang-2015-0056
Schmidt, S. (2009). Shall we really do it again? The powerful concept of replication
is neglected in the social sciences. Review of General Psychology, 13, 90–100.
doi:10.1037/a0015108
Schönefeld, D. (2011). Converging evidence: Methodological and theoretical
issues for linguistic research. Amsterdam: John Benjamins.
Schönefeld, D. (2011). Introduction: On evidence and the convergence of evidence
in linguistic research. In D. Schönefeld (Ed.), Converging evidence.
Methodological and theoretical issues for linguistic research (pp. 1–32).
Amsterdam: John Benjamins
Schütze, C. T. (1996). The empirical base of linguistics: Grammaticality judgments
and linguistic methodology. Chicago: University of Chicago Press.
References
159
Schütze, C. T. & Sprouse, J. (2013). Judgment data. In R.J. Podesva & D. Sharma
(Eds.), Research methods in linguistics (pp. 27-50). Cambridge: Cambridge
University Press.
Sebregts, K. (2015). The sociophonetics and phonology of Dutch r. Utrecht:
Utrecht University dissertation.
Seidenberg, M. S. (1997). Language acquisition and use: learning and applying
probabilistic constraints. Science, 275, 1599–1603.
Seto, E., Hua, J., Wu, L., Shia, V., Eom, S., Wang, M. & Li, Y. (2016). Models of
individual dietary behavior based on smartphone data: the influence of routine,
physical activity, emotion, and food environment. PloS ONE, 11(4), e0153085.
https://doi.org/10.1371/journal.pone.0153085
Shaoul, C., Baayen, R. H., and Westbury, C. F. (2014). N-gram probability effects
in a cloze task. The Mental Lexicon, 9, 437-472.
Shaoul, C., Westbury, C. F. & Baayen, R. H. (2013). The subjective frequency of
word n-grams. Psihologija, 46(4), 497–537.
Sharma, D. (2011). Style repertoire and social change in British Asian English.
Journal of Sociolinguistics, 15(4), 464–492.
Simmons, W. K., Martin, A. & Barsalou, L. W. (2005) Pictures of appetizing foods
activate gustatory cortices for taste and reward. Cerebral Cortex, 15, 1602–
1608.
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford: Oxford University
Press.
Siyanova-Chanturia A., Conklin, K., van Heuven, W. (2011). Seeing a phrase “time
and again” matters: The role of phrasal frequency in the processing of multiword
sequences. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 37(3), 776-784.
Smith, N. J. & Levy, R. (2011). Cloze but no cigar: The complex relationship
between cloze, corpus, and subjective probabilities in language processing.
Proceedings of the 33rd Annual Conference of the Cognitive Science Society
(pp. 1637–1642). Austin, TX: Cognitive Science Society.
Smith, N. J. & Levy, R. (2013). The effect of word predictability on reading time is
logarithmic. Cognition, 128(3), 302-319. doi:10.1016/j.cognition.2013.02.013
Smits, R., Sereno, J. & Jongman, A. (2006). Categorization of sounds. Journal of
Experimental Psychology: Human Perception and Performance, 13(3), 733–
754.
Sorace, A. (2000). Gradients in auxiliary selection with intransitive verbs.
Language, 76, 859–890.
Sprouse, J. (2008). Magnitude estimation and the non-linearity of acceptability
judgments. In N. Abner & J. Bishop (Eds.), West Coast Conference on Formal
Linguistics (WCCFL) (pp. 397–403). Somerville, MA: Cascadilla Proceedings
Project.
Sprouse, J. (2011). A test of the cognitive assumptions of magnitude estimation:
Commutativity does not hold for acceptability judgments. Language, 87(2),
274–288.
Sprouse, J. & Almeida, D. (2012). Assessing the reliability of textbook data in
syntax: Adger's Core Syntax. Journal of Linguistics, 48(3), 609–652.
160
Sprouse, J., Schütze, C. T. & Almeida, D. (2013). A comparison of informal and
formal acceptability judgments using a random sample from Linguistic Inquiry
2001-2010. Lingua, 134, 219-248.
Stanovich, K. E. & West, R. F. (1989). Exposure to print and orthographic
processing. Reading Research Quarterly, 24, 402-433.
Stefanowitsch, A. & Gries, S.Th. (2003). Collostructions: Investigating the
interaction of words and constructions. International Journal of Corpus
Linguistics, 8, 209-243.
Stolcke, A. (2002). SRILM – an extensible language modeling toolkit. Proceedings
of the International Conference on Spoken Language Processing (pp. 901–
904). Denver, Colorado.
Street, J., & Dąbrowska, E. (2010). More individual differences in language
attainment: How much do adult native speakers of English know about passives
and quantifiers? Lingua, 120(8), 2080-2094.
Street, J., & Dabrowska, E. (2014). Lexically specific knowledge and individual
differences in adult native speakers’ processing of the English passive. Applied
Psycholinguistics, 35(1), 97-118.
Stefanowitsch, A. & Gries, S. Th. (2003). Collostructions: Investigating the
interaction of words and constructions. International Journal of
Corpus Linguistics, 8, 209-243.
Stubbs, M. (1993). British traditions in text analysis: From Firth to Sinclair. In M.
Baker, G. Francis & E. Tognini-Bonelli (Eds.), Text and Technology: In honour of
John Sinclair (pp. 1-33). Amsterdam/Philadelphia: John Benjamins Publishing
Company.
Schwanenflugel, P. J. & Gaviska, D. C. (2005). Psycholinguistic aspects of word
meaning. In D. A. Cruse, F. Hundsnurscher, M. Job & P. R. Lutzeier (Eds.),
Lexikologie: Ein internationales Handbuch zur Natur und Struktur von Wörtern
und Wordschätzen [Lexicology: An international handbook on the nature and
structure of words and vocabularies] (pp. 1735-1748). Berlin: Mouton de
Gruyter.
Tabatabaei, O. & Dehghani, M. (2012). Assessing the reliability of grammaticality
judgment tests. Procedia – Social and Behavioral Sciences, 31, 173–182.
Tabossi, P., Fanari, R. & Wolf, K. (2009). Why are idioms recognized fast? Memory
& Cognition, 37(4), 529–540.
Tanner, D., Inoue, K. & Osterhout, L. (2014). Brain-based individual differences in
on-line L2 grammatical comprehension. Bilingualism, Language and Cognition,
17(2), 277-293.
Taylor, J. R. (2012). The Mental Corpus. How language is represented in the mind.
Oxford: Oxford University Press.
Taylor, W. L. (1953). ‘Cloze' procedure: A new tool for measuring readability.
Journalism Quarterly, 30, 415-433.
Theakston, A. L. (2004). The role of entrenchment in children’s and adults’
performance on grammaticality judgement tasks. Cognitive Development,
19(1), 15–34.
References
161
Tily, H., Gahl, S., Arnon, I., Kothari, A., Snider, N. & Bresnan, J. (2009). Syntactic
probabilities affect pronunciation variation in spontaneous speech. Language &
Cognition, 1, 147-165.
Tomasello, M. (2003). Constructing a language: A usage-based theory of
language acquisition. Cambridge, MA: Harvard University Press.
Traxler, M. J. & Foss, D. J. (2000). Effects of sentence constraint on priming in
natural language comprehension. Journal of Experimental Psychology:
Learning, Memory, and Cognition, 26(5), 1266-1282. doi:10.1037/02787393.26.5.1266
Treadwell, C. (2017). Introducing communication research. Paths of inquiry (3rd
ed.). Thousand Oaks, CA: Sage Publications.
Tremblay, A. & Baayen, R. H. (2010). Holistic processing of regular four-word
sequences: A behavioral and ERP study of the effects of structure, frequency,
and probability on immediate free recall. In D. Wood (Ed.), Perspectives on
formulaic language; Acquisition and communication (pp. 151–173). London:
The Continuum International Publishing Group.
Tremblay, A. & Tucker, B. V. (2011). The effects of N-gram probabilistic measures
on the recognition and production of four-word sequences. The Mental Lexicon,
6(2), 302–324.
Tryk, H. (1968). Subjective scaling of word frequency. American Journal of
Psychology, 81(2), 170-177.
Tvesis, C. (2008). Hope over fear [Mosaic Illustration]. Retrieved from
http://www.dripbook.com/tsevis/illustration-portfolio/barack-obamai/#288337
(6 February, 2014).
University of Twente, Human Media Interaction (n.d.). Twente News Corpus
(TwNC): A multifaceted Dutch news corpus. Retrieved from
http://hmi.ewi.utwente.nl/TwNC/description
Unsworth, S. & Blom, E. (2010). Comparing L1 children, L2 children and L2 adults.
In E. Blom & S. Unsworth (Eds.), Experimental Methods in Language Acquisition
Research (pp. 201-222). Amsterdam: Benjamins.
Van Berkum, J. J. A., Brown, C. M., Zwitserlood, P., Kooijman, V. & Hagoort, P.
(2005). Anticipating upcoming words in discourse: Evidence from ERPs and
reading times. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 31, 443–467.
Van den Bemd, E., Mos, M., Alishahi, A. & Shayan, S. (2014). Does sentence
structure boost early word learning? An artifical language learning study. Wiener
Linguistische Gazette, 78(A), 103-119.
VanGeest, J., Wynia, M., Cummins, D. & Wilson, I. (2002). Measuring deception:
test-retest reliability of physicians’ self-reported manipulation of reimbursement
rules for patients. Medical Care Research and Review, 59(2), 184–196.
Verhagen, A. (2003). Hoe het Nederlands zich een eigen weg baant: Vergelijkende
en historische observaties vanuit een constructie-perspectief. Nederlandse
Taalkunde, 8, 328-346.
Verhagen, A. (2005). Constructiegrammatica en 'usage based' taalkunde.
Nederlandse Taalkunde, 10, 197-222.
162
Verhagen, V. & Backus, A. (2011). Individual differences in the perception of
entrenchment of multiword units: Evidence from a Magnitude Estimation task.
Toegepaste Taalwetenschap in Artikelen [Applied Linguistics in Article Form],
84/85, 155–165.
Vindras, P., Desmurget, M. & Baraduc, P. (2012). When one size does not fit all: a
simple statistical method to deal with across-individual variations of
effects. PLoS ONE, 7(6), e39059.
https://doi.org/10.1371/journal.pone.0039059
von Eye, A. & Bogat, G. A. (2006). Person-oriented and variable-oriented research:
Concepts, results, and development. Merrill-Palmer Quarterly, 52(3), 390-420.
von Eye, A., Bogat, G. A. & Rhodes, J. E. (2006). Variable-oriented and personoriented perspectives of analysis: The example of alcohol consumption in
adolescence. Journal of Adolescence, 29(6), 981-1004.
von Eye, A., Mun, E. Y. & Indurkhya, A. (2004). Typifying developmental
trajectories: A decision-making perspective. Psychology Science, 46, 65–98.
Wasow, T. & Arnold, J. (2005). Intuitions in linguistic argumentation. Lingua, 115,
1481–1496.
Waters, G. & Caplan, D. (1996). The measurement of verbal working memory
capacity and its relation to reading comprehension. Quarterly Journal of
Experimental Psychology, 49, 51-79.
Wechsler, D. (1981). The Wechsler Adult Intelligence Scale-Revised. New York:
Psychological Corporation.
Wells, J. B., Christiansen, M. H., Race, D. S., Acheson, D. J. & MacDonald, M.
C. (2009). Experience and sentence processing: Statistical learning and relative
clause comprehension.
Cognitive
Psychology,
58,
250–271.
doi:10.1016/j.cogpsych.2008.08.002
Weng, L. (2004). Impact of the number of response categories and anchor labels
on coefficient alpha and test-retest reliability. Educational and Psychological
Measurement, 64(6), 956–972.
Weskott, T. & Fanselow, G. (2011). On the informativity of different measures of
linguistic acceptability. Language, 87(2), 249–273.
Westfall, J., Kenny, D. & Judd, C. (2014). Statistical power and optimal design in
experiments in which samples of participants respond to samples of stimuli.
Journal of Experimental Psychology: General, 143(5), 2020-2045.
Wiechmann, D. (2008). On the computation of collostruction strength. Corpus
Linguistics and Linguistic Theory, 4(2), 253-290.
Willems, R. M., Frank, S. L., Nijhof, A. D., Hagoort, P. & Bosch, A. van den. (2016).
Prediction during natural language comprehension. Cerebral Cortex, 26, 25062516. doi:10.1093/cercor/bhv075
Williams, R. & Morris, R. (2004). Eye movements, word familiarity, and vocabulary
acquisition. European Journal of Cognitive Psychology, 16, 312–339.
Wray, A. (2002). Formulaic language and the lexicon. Cambridge, England:
Cambridge University Press.
Wulff, S. (2009). Converging evidence from corpus and experimental data to
capture idiomaticity. Corpus Linguistics and Linguistic Theory, 5(1), 131–159.
References
163
Zachary, R. (1994). Shipley Institute of Living Scale, revised manual. Los Angeles:
Weston Psychological Services.
Zimmerer. V., Cowell, P., Varley, R. (2011). Individual behavior in learning of an
artificial grammar. Memory and Cognition, 39, 491–501. doi:10.3758/s13421010-0039-y
Zipf, G. K. (1935). The psychobiology of language: An introduction to dynamic
philology. Boston: Houghton Mifflin Company.
164
Appendix 2.1
165
Appendices
Appendix 2.1
Stimuli
– context
naar huis
home
+ context
Hij is gisteren vroeg naar huis gegaan.
He went home early yesterday.
2
op school
at school
Met die jongen heb ik vroeger op school
gezeten.
I went to school with that boy.
3
op vakantie
on vacation
De buren zijn vorige week op vakantie gegaan.
The neighbors went on vacation last week.
4
in de klas
in the classroom
De jongen zit in de klas naast zijn beste
vriend.
In the classroom the boy sits next to his best
friend.
5
in de tuin
in the garden
De man is in de tuin aan het werk.
The man is working in the garden.
6
in de keuken
in the kitchen
7
in de auto
in the car
Zijn moeder was in de keuken bezig met het
avondeten.
His mother was busy with the evening meal in
the kitchen.
We hebben de boodschappen in de auto
gelegd.
We put the shopping in the car.
8
in bed
in bed
Zij is nog niet in bed gaan liggen.
She has not yet lain down in bed.
9
in de kamer
in the room
Hij heeft het nieuwe schilderij in de kamer
opgehangen.
He has hung the new painting in the room.
10
aan tafel
at table
De jongen zit aan tafel zijn ontbijt te eten.
The boy is seated at the table eating his
breakfast.
11
op de bank
on the couch
De jongens liggen op de bank televisie te
kijken.
The boys are lying on the couch watching tv.
12
in slaap
asleep
Ik kon vannacht niet in slaap vallen.
I couldn’t fall asleep last night.
1
166
13
in het water
in the water
De kinderen zijn in het water aan het spelen.
The children are playing in the water.
14
in de lucht
in the air
De maatregelen hebben al langer in de lucht
gehangen.
The measures have been up in the air for a
while.
15
in de hand
in the hand
Ze hebben zelf in de hand wat er gaat
gebeuren.
It’s in their own hands what will happen.
16
in de winkel
in the shop
Zijn nieuwe CD is in de winkel te koop.
His new CD is for sale in the shops.
17
in de kerk
in the church
Mijn ouders zijn in de kerk getrouwd.
My parents got married in church.
18
in de bus
in the bus
Het meisje zit in de bus met haar moeder.
The girl is sitting in the bus with her mother.
19
aan de beurt
be next
Hij wacht totdat hij aan de beurt is.
He waits until it is his turn.
20
op de televisie
on the television /
on tv
Ze keken een film die op de televisie werd
uitgezonden.
They watched a movie that was broadcast on
tv.
21
op de foto
in the picture
De fotograaf zorgde ervoor dat iedereen op de
foto stond.
The photographer made sure everybody was
in the picture.
22
naar de wc
to the loo
De helft van de klas ging naar de wc in de
pauze.
Half the class went to the loo in the interval.
23
in het bos
in the forest
Er wonen in het bos veel dieren.
There are a lot of animals living in the forest.
24
op de hoek
at the corner
De winkel bevindt zich op de hoek van de
straat.
The shop is located at the corner of the street.
25
in de kast
in the cupboard
Ze heeft de spulletjes in de kast gelegd.
She put the things in the cupboard.
26
in de oven
in the oven
Ze staat op het punt de appeltaart in de oven
te zetten.
She’s about to put the apple pie in the oven.
Appendix 2.1
167
27
in bad
in (the) bath
Het kindje werd door zijn moeder in bad gezet.
The little child was put in (the) bath by his
mother.
28
op de deur
on the door
Er wordt op de deur geklopt.
There’s a knock at the door.
29
achter de
computer
behind the
computer
De jongens zitten veel achter de computer
volgens hun moeder.
The boys spend a lot of time at the computer.
according to their mother.
30
in de film
in the film
De actrice gaat de hoofdrol in de film
vertolken.
The actress will play the leading part in the
film.
31
in het licht
in the light
Zij waarschuwde hem niet recht in het licht te
kijken.
She warned him not to look straight into the
light.
32
in de pan
in the pan
33
op de muur
on the wall
Je moet de groenten in de pan doen en even
laten koken.
You have to put the vegetables in the pan and
let them boil for a while.
Het filmpje werd op de muur geprojecteerd.
The film was projected on the wall.
34
in de kring
in the ring
De kinderen praten in de kring over het
weekend.
The children talked about the weekend in the
ring.
35
van het dak
off the roof /
of the roof
De tuinman bood aan de bladeren van het dak
te vegen.
The gardener offered to sweep the leaves off
the roof.
36
in het bed
in the bed
De jongen lag nog in het bed toen zijn moeder
binnenkwam.
The boy was still lying in the bed when his
mother came in.
37
tegen mama
to mom
Hij zei tegen mama dat hij niet ging.
He told mom he didn’t go.
168
38
in de buik
in the stomach
Hij vertelde de dokter dat hij pijn in de buik
heeft.
He told the doctor he has pain in the stomach.
39
met de hond
with the dog
Mijn ouders gaan met de hond wandelen.
My parents are going to take the dog for a
walk.
40
in de bak
in the bin / in jail
Criminelen horen in de bak te zitten.
Criminals should be in jail.
41
in het paleis
in the palace
De bruiloft werd in het paleis gevierd.
The wedding was celebrated in the palace.
42
in het hoofd
in the head
Hij had een plan in het hoofd toen hij vertrok.
He had a plan in mind when he left.
43
in het bad
in the bath
in zijn eentje
on his own
Het water in het bad bubbelt.
The water in the bath is bubbling.
Hij gaat in zijn eentje zitten.
He is going to sit by himself.
44
Appendix 2.2
Raw frequencies and base-10 logarithms of the frequency of occurrence per million words in the Corpus of Spoken Dutch
(CGN) for the noun (lemma search) and the specific phrase as a whole; mean familiarity ratings and standard deviations
both at Time 1 and Time 2.
Time 1
Freq
N
Time 2
Log
FreqN
Freq
PP
Log
FreqPP
–context
M (SD)
2.05
0.75 (.54)
0.71
(.64)
0.74 (.64)
0.69
(.67)
1.89
0.65 (.72)
0.68
(.66)
0.68 (.67)
0.68
(.52)
+context
M (SD)
–context
M (SD)
+context
M (SD)
1
naar huis
4730
2.70
2
op school
3572
2.58
106
7426
3
op vakantie
1715
2.26
480
1.70
0.85 (.51)
0.73 (.62)
0.92
(.56)
0.80
(.58)
4
in de klas
1484
2.19
341
1.56
0.33 (.61)
0.10 (.69)
0.39
(.64)
0.22
(.64)
5
in de tuin
873
1.96
251
1.42
0.29 (.68)
0.31 (.63)
0.30
(.69)
0.26
(.64)
6
in de keuken
727
1.88
223
1.37
0.40 (.69)
0.53 (.50)
0.36
(.66)
0.51
(.54)
7
in de auto
2811
2.47
207
1.34
0.40 (.64)
0.54 (.58)
0.41
(.74)
0.29
(.62)
8
in bed
1290
2.13
194
1.31
0.69 (.70)
0.28 (.88)
0.67
(.83)
0.56
(.65)
in de kamer
1941
2.31
190
1.30
0.27 (.68)
0.12 (.78)
0.17
(.85)
0.01
(.73)
aan tafel
1233
2.11
187
1.30
0.58 (.61)
0.48 (.68)
0.53
(.62)
0.35
(.66)
11
op de bank
877
1.97
172
1.26
0.61 (.49)
0.53 (.68)
0.54
(.70)
0.57
(.57)
12
in slaap
341
1.56
167
1.25
-0.18 (1.08)
0.28 (.90)
0.24
(.98)
0.46
(.76)
13
in het water
1959
2.31
148
1.20
0.18 (.70)
0.10 (.75)
0.11
(.77)
0.09
(.81)
Appendix 2.2
9
10
169
Freq
N
Log
FreqN
Freq
PP
Log
FreqPP
–context
(SD)
M
170
Time 1
Time 2
–context
(SD)
+context
(SD)
M
M
+context
(SD)
M
14
in de lucht
636
1.83
147
1.19
-0.05 (.78)
-0.92 (1.07)
-0.25
(.89)
-0.34
(.91)
15
in de hand
3062
2.51
141
1.17
-0.59 (.97)
-0.04 (.88)
-0.54 (1.00)
-0.19
(.94)
16
in de winkel
838
1.95
136
1.16
0.45 (.55)
0.30 (.70)
0.31
(.66)
0.25
(.79)
17
in de kerk
961
2.01
119
1.10
-0.35 (.93)
0.04 (.88)
-0.17
(.96)
-0.05
(.87)
18
in de bus
995
2.02
118
1.10
0.30 (.68)
0.46 (.63)
0.30
(.74)
0.31
(.62)
19
aan de beurt
294
1.49
108
1.06
0.16 (.69)
0.38 (.55)
0.18
(.73)
0.36 (1.18)
20
op de televisie
617
1.81
69
0.87
-0.27 (1.09)
-0.18 (1.09)
-0.20 (1.10)
-0.42 (1.29)
21
op de foto
1439
2.18
69
0.87
0.43 (.63)
0.46 (.63)
0.39
(.69)
0.44
(.67)
22
naar de wc
264
1.45
67
0.85
0.58 (.60)
0.56 (.62)
0.60
(.65)
0.38
(.70)
23
in het bos
493
1.72
66
0.85
0.17 (.74)
0.19 (.73)
0.13
(.74)
0.00
(.91)
24
op de hoek
606
1.81
65
0.84
-0.21 (.85)
0.13 (.77)
-0.27
(.89)
0.12
(.74)
25
in de kast
600
1.80
62
0.82
0.10 (.72)
0.35 (.66)
0.06
(.78)
0.24
(.59)
26
in de oven
223
1.37
60
0.81
0.23 (1.02)
0.36 (.55)
0.25
(.70)
0.42
(.62)
27
in bad
181
1.28
54
0.76
0.29 (.75)
0.06 (.81)
0.41
(.72)
0.15
(.72)
28
op de deur
1495
2.20
51
0.74
-0.64 (.97)
0.20
(.80)
-0.39
(.96)
0.02
(.82)
29
achter de
computer
1099
2.06
41
0.65
0.47 (.73)
0.34 (.73)
0.50
(.68)
0.39
(.74)
Time 1
Freq
N
Log
FreqN
Freq
PP
Log
FreqPP
0.55
–context
(SD)
Time 2
–context
(SD)
+context
(SD)
M
-0.50 (1.06)
0.27 (.70)
-0.47
(.99)
0.17
(.80)
M
M
+context
(SD)
M
30
in de film
1658
2.24
33
31
in het licht
1340
2.15
32
0.54
-0.53 (.85)
-0.18 (.74)
-0.46
(.86)
-0.24
(.78)
32
in de pan
214
1.35
29
0.50
0.05 (.74)
0.45 (.53)
0.15
(.66)
0.23
(.65)
33
op de muur
782
1.92
28
0.48
-0.50 (.99)
0.08 (.67)
-0.35
(.83)
-0.02
(.64)
34
in de kring
228
1.38
27
0.47
-0.61 (.88)
-0.30 (.98)
-0.46
(.87)
-0.41 (1.17)
35
van het dak
423
1.65
24
0.42
-1.05 (.97)
-0.33 (.87)
-0.68
(.96)
-0.24
36
in het bed
1290
2.13
23
0.40
-0.51 (1.24)
-1.16 (1.29)
-0.51 (1.23)
-1.11 (1.43)
37
tegen mama
1188
2.10
22
0.38
-0.61 (1.10)
-0.24 (1.13)
-0.48 (1.18)
-0.37 (1.13)
38
in de buik
286
1.48
18
0.30
-1.24 (.96)
-1.65 (1.15)
-1.37 (1.03)
-1.50 (1.13)
39
met de hond
789
1.92
15
0.23
-0.19 (1.06)
0.27 (1.23)
-0.30
(.93)
0.06
(.89)
40
in de bak
331
1.54
15
0.23
-0.91 (1.02)
-0.38 (.91)
-0.82 (1.06)
-0.31
(.99)
41
in het paleis
160
1.23
13
0.17
-0.71 (1.02)
-0.37 (.91)
-0.77 (1.01)
-0.42 (1.07)
42
in het hoofd
1700
2.25
13
0.17
-1.27 (1.09)
-1.57 (1.03)
-1.14 (1.17)
-1.67 (1.17)
43
in het bad
181
1.28
12
0.14
-0.64 (1.11)
-0.40 (.91)
-0.64 (1.00)
-0.44
(.99)
44
in zijn eentje
29
0.50
9
0.02
-0.28 (.95)
0.01 (.88)
-0.27
-0.11
(.96)
Appendix 2.2
(.90)
(.80)
171
172
Appendix 3.1
Stimuli in the order of presentation
1
naar huis
home
2
uit de kast
from the cupboard; out of the closet
3
bij de fietsen
near the bicycles
4
op papier
on paper
5
in de groente
in the vegetables
6
onder de wol
underneath the wool; turn in
7
op het boek
on the book; on top of the book
8
onder de mat
underneath the mat
9
onder het asfalt
underneath the asphalt
10
in de shampoo
in the shampoo
11
in het geld
12
langs de auto
in the money (zwemmen in het geld ‘have
pots of money’)
past the car
13
in het algemeen
in general
14
op vakantie
on vacation
15
in de winkel
in the shop
16
in het bos
in the forest
17
op de bon
on the ticket (also: be booked; rationed)
18
naast het hek
beside the fence
19
voor de schommel
in front of the swing
20
langs de boeken
along the books
21
in de lucht
in the air
22
tot morgen
till tomorrow
23
in de klas
in the classroom
24
in de pan
in the pan
25
in de kamer
in the room
26
uit de kom
from the bowl; out of its socket
27
in de oven
in the oven
28
in de bak
in the bin; in jail
29
in de piano
in the piano
30
naast de bloemen
beside the flowers
31
voor de juf
for the teacher/Miss
32
naast het café
beside the cafe
33
tegen de vlakte
against the plain (tegen de vlakte gaan ‘be
knocked down’)
Appendix 3.1
173
34
uit de gang
from the corridor
35
naar de boom
towards the tree
36
op de pof
on tick
37
tegen de grond
against the ground; to the ground
38
onder de dekens
underneath the blankets
39
over de kop
40
rond de middag
over the head (over de kop gaan ‘overturn’
and ‘go broke’;
zich over de kop werken ‘work oneself to
death’)
around midday
41
onder elkaar
42
van het dak
amongst themselves; by ourselves; one
below the other
off the roof; of the roof
43
aan tafel
at table
44
naar de wc
to the loo
45
langs het park
along the park
46
met gemak
with ease
47
op televisie
on the television; on tv
48
naast de auto
beside the car
49
in het donker
in the dark
50
om de tekeningen
for the drawings; around the drawings
51
in de tuin
in the garden
52
in de oren
53
langs het water
in the ears (iets in de oren knopen ‘get
something into one’s head;
gaatjes in de oren hebben ‘have pierced
ears’)
along the water
54
in bad
in (the) bath
55
in de koffie
in the coffee
56
tegen mama
to mom; against mom
57
over de streep
58
in het paleis
across the line (iemand over de streep
trekken ‘win someone over’)
in the palace
59
uit de kunst
out of the art; amazing
60
in de bus
in the bus
61
op de bank
on the couch
62
op de hoek
at the corner
174
63
met het doel
64
over het gras
with the goal (met het doel om ’with a view
to’)
across the grass; about the grass
65
over het karton
over the cardboard; about the cardboard
66
in de keuken
in the kitchen
67
met de schoen
with the shoe
68
op de film
on (the) film
69
op de meester
70
in de kast
on the teacher/master; at the
teacher/master
in the cupboard
71
aan de beurt
be next
72
langs de tafel
along the table
73
uit het niets
out of nothingness
74
in de auto
in the car
75
in de rondte
in a circle
76
in de foto
in the picture
77
op school
at school
78
rond de ingang
around the entrance
79
uit de trommel
from the tin box; out of the tin box
Appendix 3.2 Raw frequency and base-10 logarithms of the frequency of occurrence per million words in the subset of the corpus SoNaR*
for the noun (lemma search) and the specific phrase as a whole; mean familiarity ratings and standard deviations both at Time 1 and Time 2.
* This subset consists of texts originating from the Netherlands (143.8 million words) and texts originating either from the Netherlands or
Belgium (51.8 million words).
Time 1
Freq
N
Log
FreqN
Freq
PP
Log
FreqPP
M
Likert
(SD)
Time 2
M
ME
(SD)
M
Likert
(SD)
ME
M
(SD)
naar huis
84918
2.64
14688
1.88
0.94 (0.41)
1.17 (0.50)
1.02
(0.50)
1.36
(0.51)
op school
58222
2.47
8543
1.64
0.81 (0.34)
1.15 (0.79)
0.95
(0.37)
1.00
(0.61)
13
in het algemeen
37893
2.29
5778
1.47
0.54 (0.74)
0.97 (0.42)
0.65
(0.65)
0.87
(0.64)
21
in de lucht
17713
1.96
4485
1.36
0.61 (0.38)
0.47 (0.48)
0.65
(0.43)
0.63
(0.43)
61
op de bank
28615
2.17
4221
1.33
0.86 (0.36)
0.93 (0.56)
0.89
(0.66)
1.06
(0.57)
14
op vakantie
15864
1.91
3742
1.28
0.86 (0.39)
1.14 (0.43)
0.88
(0.46)
1.07
(0.45)
74
in de auto
37927
2.29
3532
1.26
0.84 (0.36)
0.90 (0.44)
0.77
(0.53)
0.88
(0.62)
25
in de kamer
44194
2.35
3259
1.22
0.59 (0.46)
0.94 (0.49)
0.77
(0.41)
0.74
(0.62)
51
in de tuin
10213
1.72
2860
1.17
0.65 (0.58)
0.60 (0.60)
0.64
(0.57)
0.83
(0.55)
4
op papier
11249
1.76
2606
1.12
0.80 (0.30)
0.82 (0.53)
0.75
(0.42)
0.75
(0.61)
Appendix 3.2
1
77
175
176
Time 1
Freq
N
Log
FreqN
Freq
PP
Log
FreqPP
M
Likert
(SD)
Time 2
M
ME
(SD)
M
Likert
(SD)
ME
M
(SD)
43
aan tafel
2827
1.16
2439
1.10
0.76 (0.32)
0.89 (0.44)
0.79
(0.58)
0.98
(0.51)
66
in de keuken
7584
1.59
2174
1.05
0.72 (0.47)
0.92 (0.40)
0.68
(0.58)
0.92
(0.48)
47
op televisie
12003
1.79
1955
1.00
0.82 (0.49)
1.02 (0.60)
0.83
(0.55)
1.18
(0.47)
23
in de klas
7181
1.56
1924
0.99
0.69 (0.43)
0.93 (0.57)
0.69
(0.63)
0.80
(0.65)
22
tot morgen
46260
2.37
1820
0.97
0.91 (0.36)
1.36 (0.55)
1.08
(0.50)
1.32
(0.44)
71
aan de beurt
7759
1.60
1743
0.95
0.71 (0.42)
0.55 (0.64)
0.70
(0.49)
0.77
(0.51)
15
in de winkel
12870
1.82
1611
0.92
0.74 (0.50)
1.05 (0.45)
0.82
(0.52)
0.82
(0.61)
60
in de bus
12053
1.79
1533
0.89
0.71 (0.33)
0.68 (0.59)
0.66
(0.48)
0.60
(0.73)
49
in het donker
13022
1.82
1521
0.89
0.78 (0.27)
0.76 (0.43)
0.75
(0.45)
0.90
(0.37)
16
in het bos
16681
1.93
1295
0.82
0.53 (0.49)
0.83 (0.55)
0.57
(0.59)
0.61
(0.58)
54
in bad
6416
1.52
1275
0.81
0.71 (0.39)
0.67 (0.71)
0.70
(0.73)
0.61
(0.69)
2
uit de kast
6118
1.50
1048
0.73
0.07 (1.00)
0.45 (0.60)
0.29
(0.75)
0.42
(0.60)
70
in de kast
6118
1.50
1010
0.71
0.55 (0.39)
0.39 (0.67)
0.41
(0.62)
0.34
(0.60)
44
naar de wc
17185
1.94
804
0.61
0.91 (0.36)
1.04 (0.51)
0.93
(0.52)
1.23
(0.48)
62
op de hoek
11205
1.76
756
0.59
0.19 (0.76)
0.14 (0.77)
0.05
(0.99)
0.18
(0.66)
Time 1
Freq
N
Log
FreqN
Freq
PP
Log
FreqPP
M
Likert
(SD)
Time 2
M
ME
(SD)
M
Likert
(SD)
ME
M
(SD)
onder elkaar
89055
2.66
688
0.55
0.47 (0.61)
0.45 (0.45)
0.36
(0.64)
0.49
(0.56)
52
in de oren
10856
1.74
667
0.53
-0.36 (0.91)
-0.44 (0.79)
-0.22
(0.89)
-0.54
(0.81)
27
in de oven
2273
1.07
651
0.52
0.60 (0.39)
0.66 (0.59)
0.58
(0.47)
0.57
(0.58)
24
in de pan
4233
1.34
585
0.48
0.43 (0.45)
0.64 (0.62)
0.43
(0.58)
0.46
(0.67)
63
met het doel
26189
2.13
558
0.46
0.24 (0.75)
0.08 (0.94)
0.22
(0.54)
0.23
(0.69)
46
met gemak
3490
1.25
528
0.43
0.58 (0.37)
0.38 (0.63)
0.42
(0.66)
0.56
(0.61)
73
uit het niets
89997
2.66
490
0.40
0.72 (0.42)
0.49 (0.73)
0.53
(0.63)
0.57
(0.54)
57
over de streep
6570
1.53
483
0.39
0.33 (0.56)
0.33
(0.58)
0.14
(0.66)
58
in het paleis
37
tegen de grond
28
in de bak
42
7
5394
1.44
427
0.34
-0.15 (0.96)
0.08 (0.67)
0
-0.51 (0.74)
-0.39
(0.93)
-0.40
(0.69)
33283
2.23
369
0.28
-0.28 (0.84)
-0.53 (0.66)
-0.24
(0.72)
-0.69
(0.72)
5597
1.46
295
0.18
0.13 (0.62)
-0.16 (0.58)
0.02
(0.68)
-0.41
(0.63)
van het dak
6202
1.50
280
0.16
-0.40 (0.70)
-0.48 (0.61)
-0.40
(0.78)
-0.43
(0.52)
op het boek
74296
2.58
274
0.15
-0.59 (0.83)
-0.64 (0.67)
-0.30
(0.79)
-0.61
(0.55)
39
over de kop
19931
2.01
251
0.11
0.66 (0.36)
0.33 (0.49)
0.40
(0.56)
0.34
(0.54)
75
in de rondte
205
0.02
195
0.00
-0.02 (0.78)
-0.34 (0.70)
-0.28
(0.83)
-0.28
(0.56)
Appendix 3.2
41
177
178
Time 1
Freq
N
38
onder de dekens
68
op de film
33
17
6
Log
FreqN
Freq
PP
Log
FreqPP
M
Likert
(SD)
Time 2
M
ME
(SD)
M
Likert
(SD)
ME
M
(SD)
2585
1.12
175
-0.05
0.59 (0.33)
0.46 (0.69)
0.50
(0.50)
0.49
(0.59)
47205
2.38
145
-0.13
-0.80 (0.87)
-0.78 (0.90)
-0.88
(0.89)
-0.79
(0.84)
tegen de vlakte
1682
0.93
141
-0.14
0.19 (0.70)
-0.05 (0.54)
0.13
(0.60)
0.01
(0.67)
op de bon
2267
1.06
103
-0.27
0.02 (0.68)
-0.19 (0.74)
0.05
(0.77)
-0.26
(0.72)
onder de wol
1068
0.74
96
-0.30
-0.01 (0.86)
-0.07 (0.73)
0.05
(0.73)
0.19
(0.80)
803
0.61
95
-0.31
-0.03 (0.67)
-0.16 (0.73)
-0.33
(0.82)
-0.14
(0.58)
42001
2.33
83
-0.37
0.26 (0.54)
0.07 (0.74)
-0.18
(0.86)
0.04
(0.63)
93
-0.32
67
-0.46
-0.90 (0.99)
-1.13 (0.84)
-0.78
(0.92)
-0.97
(0.91)
26
uit de kom
53
langs het water
36
op de pof
64
over het gras
4481
1.36
54
-0.55
0.14 (0.77)
-0.20 (0.77)
0.01
(0.71)
-0.14
(0.74)
56
tegen mama
8035
1.61
49
-0.59
0.18 (0.72)
0.41 (0.81)
0.37
(0.79)
0.41
(0.88)
40
rond de middag
7659
1.59
48
-0.60
0.60 (0.48)
0.64 (0.53)
0.56
(0.54)
0.66
(0.55)
11
in het geld
59244
2.48
47
-0.61
-1.36 (0.93)
-1.30 (0.47)
-1.30
(0.66)
-1.17
(0.69)
55
in de koffie
12497
1.81
46
-0.62
0.03 (0.89)
0.11 (0.91)
0.16
(0.84)
0.15
(0.90)
31
voor de juf
3250
1.22
29
-0.81
-0.05 (0.71)
-0.47 (0.75)
-0.33
(0.76)
-0.53
(0.68)
onder de mat
2443
1.10
28
-0.83
-0.09 (0.75)
-0.28 (0.72)
-0.26
(0.82)
-0.34
(0.89)
8
Time 1
Freq
N
Log
FreqN
Freq
PP
Log
FreqPP
M
Likert
(SD)
Time 2
M
ME
(SD)
M
Likert
(SD)
ME
M
(SD)
59
uit de kunst
19620
2.00
28
-0.83
-0.23 (0.96)
-0.52 (0.89)
-0.19
(0.91)
-0.47
(0.92)
35
naar de boom
13766
1.85
27
-0.84
-0.58 (0.85)
-0.55 (0.79)
-0.73
(0.70)
-0.75
(0.59)
76
in de foto
35457
2.26
26
-0.86
-1.40 (0.84)
-1.19 (0.68)
-1.30
(1.06)
-1.23
(0.76)
48
naast de auto
37927
2.29
25
-0.88
0.15 (0.70)
-0.05 (0.66)
-0.10
(0.73)
-0.04
(0.79)
34
uit de gang
17176
1.94
24
-0.89
-0.94 (0.84)
-0.93 (0.61)
-0.79
(0.91)
-0.99
(0.60)
72
langs de tafel
17185
1.94
22
-0.93
-0.36 (0.84)
-0.66 (0.72)
-0.72
(0.95)
-0.44
(0.64)
onder het asfalt
1208
0.79
13
-1.15
-1.40 (0.98)
-0.98 (0.67)
-1.21
(0.87)
-1.19
(0.63)
met de schoen
7970
1.61
11
-1.21
-0.89 (0.95)
-0.88 (0.52)
-0.83
(0.80)
-0.84
(0.67)
50
om de tekeningen
5756
1.47
9
-1.29
-1.48 (0.74)
-1.24 (0.60)
-1.22
(0.97)
-1.20
(0.48)
69
op de meester
7159
1.56
8
-1.34
-1.51 (0.88)
-1.45 (0.54)
-1.71
(0.86)
-1.53
(0.53)
78
rond de ingang
8174
1.62
8
-1.34
-0.81 (0.79)
-0.88 (0.75)
-0.71
(0.84)
-0.69
(0.68)
79
uit de trommel
716
0.56
8
-1.34
-0.47 (1.00)
-0.83 (0.69)
-0.61
(0.84)
-0.80
(0.57)
Appendix 3.2
9
67
179
180
Time 1
Freq
N
Log
FreqN
Freq
PP
Log
FreqPP
M
Likert
(SD)
Time 2
M
ME
(SD)
M
Likert
(SD)
ME
M
(SD)
3
bij de fietsen
8807
1.65
6
-1.45
-0.31 (1.02)
0.20 (0.79)
0.30
(0.62)
-0.05
(0.94)
5
in de groente
4882
1.40
6
-1.45
-1.08 (0.84)
-1.03 (0.53)
-0.83
(0.83)
-1.08
(0.65)
29
in de piano
3534
1.26
6
-1.45
-1.54 (0.67)
-1.33 (0.52)
-1.52
(0.61)
-1.40
(0.42)
12
langs de auto
37927
2.29
5
-1.51
-0.21 (0.81)
-0.26 (0.78)
-0.28
(0.87)
-0.17
(0.68)
32
naast het café
7456
1.58
5
-1.51
0.16 (0.69)
0.05 (0.64)
0.24
(0.65)
-0.07
(0.69)
45
lans het park
9253
1.67
4
-1.59
-0.59 (0.83)
-0.60 (0.70)
-0.58
(0.84)
-0.36
(0.64)
10
in de shampoo
458
0.37
3
-1.69
-0.95 (0.95)
-1.02 (0.66)
-0.84
(0.74)
-1.05
(0.74)
18
naast het hek
3778
1.29
3
-1.69
-0.16 (0.64)
-0.10 (0.67)
-0.06
(0.67)
-0.20
(0.63)
20
langs de boeken
8777
1.65
3
-1.69
-1.13 (1.00)
-1.08 (0.58)
-1.12
(0.68)
-0.97
(0.48)
30
naast de bloemen
9294
1.68
2
-1.81
-0.48 (0.83)
-0.57 (0.77)
-0.43
(0.69)
-0.78
(0.60)
19
voor de schommel
274
0.15
1
-1.99
-0.79 (0.82)
-0.65 (0.72)
-0.48
(0.84)
-0.84
(0.56)
65
over het karton
603
0.49
1
-1.99
-1.43 (0.69)
-1.35 (0.44)
-1.43
(0.75)
-1.32
(0.58)
Appendix 3.3
Appendix 3.3
181
Linear mixed-effects models
We fitted linear mixed-effects models (Baayen et al. 2008), using the LMER
function from the lme4 package in R (version 3.2.3; CRAN project; R Core Team,
2015), first to the familiarity judgments and then to the Δ-scores.
In the first analysis, we investigated to what extent the familiarity judgments
can be predicted by the frequency of the specific phrase (LOGFREQPP) and the
lemma-frequency of the noun (LOGFREQN), and to what degree the factors
RATINGSCALE (0 = Likert, 1 = Magnitude Estimation) and TIME (0 = first session, 1
= second session) exert influence. The fixed effects were standardized.
Participants and items were included as random effects. We incorporated a
random intercept for items and random slopes for both items and participants to
account for between-item and between-participant variation. The model does not
contain a by-participant random intercept, because after the Z-score
transformation all participants’ scores have a mean of 0 and a standard deviation
of 1.
We started with a random intercept only model. We added fixed effects, and
all two-way interactions, one by one and assessed by means of likelihood ratio
tests whether or not they significantly contributed to explaining variance in
familiarity judgments. We started with LOGFREQPP (χ2(1) = 86.64, p < .001). After
that, we added LOGFREQN (χ2(1) = 0.03, p = .87) and the interaction term
LOGFREQPP x LOGFREQN (χ2(1) = 0.002, p = .96), which did not improve model fit.
We then proceeded with RATINGSCALE (χ2(1) = 0.0003, p = .99), which did not
improve model fit either. The interaction term RATINGSCALE x LOGFREQPP did
contribute to the fit of the model (χ2(2) = 21.79, p < .001), as did RATINGSCALE x
LOGFREQN (χ2(2) = 6.77, p < .05). There cannot be a main effect of TIME in this
analysis, since scores were converted to Z-scores for the two sessions separately
(i.e. the mean scores at Time 1 and Time 2 were 0). We did include the two-way
interactions of TIME and the other factors. None of these was found to improve
model fit (TIME x RATINGSCALE (χ2(2) = 0.00, p = .99); TIME x LOGFREQPP (χ2(1) =
0.01, p = .91); TIME x LOGFREQN (χ2(1) = 0.01, p = .91)). Finally, PRESENTATIONORDER
did not contribute to the goodness-of-fit (χ2(1) = 1.27, p = .26). Apart from the
interaction term PRESENTATIONORDER x RATINGSCALE (χ2(2) = 7.05, p = .03), none of
the interactions of PRESENTATIONORDER and the other predictors in the model was
found to improve model fit (PRESENTATIONORDER x LOGFREQPP (χ2(1) = 1.89, p =
.17); PRESENTATIONORDER x LOGFREQN (χ2(1) = 0.38, p = .54); PRESENTATIONORDER x
Time (χ2(1) = 1.27, p = .26); PRESENTATIONORDER x LOGFREQPP x RATINGSCALE (χ2(2)
= 5.41, p = .07); PRESENTATIONORDER x LOGFREQN x RATINGSCALE (χ2(2) = 0.46, p =
.80)). The model selection procedure thus resulted in a model comprising
182
LOGFREQPP, LOGFREQN, RATINGSCALE, RATINGSCALE x LOGFREQPP, RATINGSCALE x
LOGFREQN, and PRESENTATIONORDER x RATINGSCALE.
We then added a by-item random slope for RATINGSCALE and by-participant
random slopes for LOGFREQPP and LOGFREQN. There are no by-item random slopes
for the factors LOGFREQPP, LOGFREQN, PRESENTATIONORDER, and the interactions
involving these factors, because each item has only one phrase frequency, one
lemma frequency, and a fixed position in the order of presentation. There is no byparticipant random slope for RATINGSCALE, since half of the participants only used
one scale. Within these limits, a model with a full random effect structure was
constructed following Barr et al. (2013). Subsequently, we excluded random
slopes with the lowest variance step by step until a further reduction would imply
a significant loss in the goodness of fit of the model (Matuschek et al. 2017).
Model comparisons indicated that the inclusion of the by-participant random
slopes for LOGFREQPP, LOGFREQN, and PRESENTATIONORDER, and the by-item random
slope for RATINGSCALE was justified by the data (χ2(3) = 90.21, p < .001).
Inspection of the variance inflation factors revealed that there do not appear to be
harmful effects of collinearity (the highest VIF value is 1.20; tolerance statistics
are 0.83 or more, cf. Field et al. 2012: 275). Confidence intervals were estimated
via parametric bootstrapping over 1000 iterations (Bates et al. 2015). The model
is summarized in Table 3.2.
In a separate analysis, we ran linear mixed-effects models on the Δ-scores, to
determine which factors influence variation across time. The absolute Δ-scores
indicate the extent to which a participant’s rating for a particular item at Time 2
differs from the rating at Time 1 (see Section 3.3.5). For each item, we have a list
of 91 Δ-scores that express each participant’s stability in the grading. In order to
fit a linear mixed-effects model on the set of Δ-scores, we log-transformed them
using the natural logarithm function. The absolute Δ-scores constitute the positive
half of a normal distribution. Log-transforming the scores yields a normal
distribution, thus complying with the assumptions of parametric statistical tests.
LOGFREQPP, LOGFREQN, RATINGSCALET1 and RATINGSCALET2 (the type of scale
used at Time 1 and Time 2 respectively, i.e. Likert or ME), and PRESENTATIONORDER
were included as fixed effects and standardized. Participants and items were
included as random effects. We incorporated a random intercept for both items
and participants to account for between-item and between-participant variation.
We then added fixed effects one by one and assessed by means of likelihood ratio
tests whether or not they significantly contributed to explaining variance in logtransformed absolute Δ-scores. We started with LOGFREQPP (χ2(1) = 32.92, p <
.001). After that, we added LOGFREQN (χ2(1) = 0.04, p = .84). Given that LOGFREQN
did not improve model fit, we left out this predictor. We then proceeded with
RATINGSCALET1 (χ2(1) = 0.15, p = .70) and RATINGSCALET2 (χ2(1) = 2.39, p = .12),
Appendix 3.3
183
neither of which improved model fit. The interaction term RATINGSCALET1 x
RATINGSCALET2 did not contribute to the fit of the model fit either (χ2(3) = 6.67, p
= .08). The interaction term RATINGSCALET1 x LOGFREQPP did improve model fit
(χ2(2) = 40.94, p < .001), as did RATINGSCALET2 x LOGFREQPP (χ2(2) = 13.91, p <
.001). The three-way interaction RATINGSCALET1 x RATINGSCALET2 x LOGFREQPP did
not explain a significant portion of variance (χ2(2) = 4.63, p = .10). Finally, neither
PRESENTATIONORDER (χ2(1) = 0.27, p = .60), nor any of the interactions of
PRESENTATIONORDER and the other predictors in the model was found to improve
model fit (PRESENTATIONORDER x LOGFREQPP (χ2(1) = 1.75, p = .19);
PRESENTATIONORDER x LOGFREQPP x RATINGSCALET1 (χ2(2) = 2.52, p = .28);
PRESENTATIONORDER x LOGFREQPP x RATINGSCALET2 (χ2(2) = 1.78, p = .41)). The
model selection procedure thus resulted in a model comprising LOGFREQPP,
RATINGSCALET1 x LOGFREQPP, and RATINGSCALET2 x LOGFREQPP.
We then added by-item random slopes for RATINGSCALET1 and RATINGSCALET2,
and a by-participant random slope for LOGFREQPP, thus constructing a model with
a full random effect structure following Barr et al. (2013). Subsequently, we
excluded random slopes with the lowest variance step by step until a further
reduction would imply a significant loss in the goodness of fit of the model
(Matuschek et al. 2017). Model comparisons indicated that the inclusion of the
by-item random slope for RATINGSCALET1 and the by-participant random slopes for
LOGFREQPP was justified by the data (χ2(2) = 12.96, p < .01). Inspection of the
variance inflation factors revealed that there do not appear to be harmful effects
of collinearity (the highest VIF value is 2.76; tolerance statistics are 0.36 or more).
Again, confidence intervals were estimated via parametric bootstrapping over
1000 iterations. The model is summarized in Table 3.5.
184
Appendix 4.1
Job ad word sequences and corpus-based frequencies and surprisal estimates
The Job ad word sequences; base-10 logarithm of the frequency of occurrence per million words in the Job ad corpus and the NLCOW14subset for the phrase as a whole and for the final word (lemma search); the surprisal of the final word based on data in NLCOW14-subset.
Based on Job ad
corpus
Based on NLCOW14-subset
LogFreq.
phrase
LogFreq.
phrase
Surprisal
final word
LogFreq.
final word
1
40 uur per week
2.52
-0.40
41
0.92
2
voor meer informatie
2.36
0.37
84
1.33
3
kennis en ervaring
2.10
0.12
110
1.07
4
hoog in het vaandel
1.84
0.34
24
-0.32
5
werving en selectie
1.82
-0.54
119
0.63
6
een vast dienstverband
1.87
-0.77
332
-0.09
7
voor langere tijd
1.65
0.08
91
1.33
8
het eerste aanspreekpunt
1.48
-0.63
397
-0.15
9
goede contactuele eigenschappen
1.39
-1.22
339
0.82
10
bij gebleken geschiktheid
1.32
-1.06
217
-0.24
Based on Job ad
corpus
LogFreq.
phrase
Based on NLCOW14-subset
LogFreq.
phrase
Surprisal
final word
LogFreq.
final word
academisch werk- en denkniveau
1.00
-1.33
29
-0.85
12
een grote mate van zelfstandigheid
1.15
-0.99
46
0.07
13
in een hecht team
0.82
-1.57
119
1.08
14
een persoonlijk ontwikkelingsplan
0.55
-1.27
537
-0.71
15
een sterk analytisch vermogen
0.67
-1.69
208
0.89
16
met de mogelijkheid tot verlenging
0.50
-1.69
68
0.17
17
in de breedste zin van het woord
0.94
-0.04
9
0.96
18
met een afstand tot de arbeidsmarkt
0.05
-1.06
20
0.17
19
het geschetste profiel
0.24
-1.87
1546
0.58
20
in de meest uiteenlopende sectoren
0.39
-2.17
135
0.34
21
een vliegende start
0.10
-0.49
226
1.23
22
bewijs van goed gedrag
0.11
-0.99
71
1.17
23
conform de geldende CAO
-0.08
-1.87
151
0.20
24
met behoud van uitkering
-0.02
-0.51
45
0.47
25
bevoegd en bekwaam
-0.08
-1.39
247
0.04
Appendix 4.1
11
185
186
Based on Job ad
corpus
Based on NLCOW14-subset
LogFreq.
phrase
LogFreq.
phrase
Surprisal
final word
LogFreq.
final word
een integrale benadering
-0.17
-0.81
342
0.83
27
naar aanleiding van de advertentie
-0.56
-2.17
96
0.28
28
-0.51
-1.87
919
0.82
29
eenvoudige administratieve
werkzaamheden
scherpe blik
een
-0.52
-1.17
447
0.86
30
buiten de geijkte paden
-0.90
-1.57
110
0.37
31
affiniteit met het onderwerp
-0.74
-1.69
112
0.84
32
een internationale speler van formaat
-1.24
-2.17
519
0.55
33
een flinke portie lef
-1.39
-2.17
344
0.07
34
met bewezen kwaliteiten
-1.17
-2.17
1586
0.59
35
een collegiale opstelling
-1.29
-2.17
13960
0.55
26
Appendix 4.2
News report word sequences and corpus-based frequencies and surprisal estimates
The News report word sequences; base-10 logarithm of the frequency of occurrence per million words in the Twente News Corpus and the
NLCOW14-subset for the phrase as a whole and for the final word (lemma search); the surprisal of the final word based on data in NLCOW14subset.
Based on NLCOW14-subset
Based on News
report corpus
LogFreq.
phrase
LogFreq.
phrase
Surprisal
final word
LogFreq.
final word
0.38
144
0.31
1.87
-0.67
211
0.81
38
verkeer en vervoer
1.80
-0.52
169
0.57
39
in elk geval
1.71
0.84
52
0.65
40
in de Verenigde Staten
1.66
0.82
27
0.20
41
het openbaar ministerie
1.16
-0.32
264
0.22
42
de negentiende eeuw
1.05
-0.58
269
0.42
43
de raad van bestuur
1.04
-2.17
101
0.77
44
aan de andere kant
1.22
0.81
28
1.10
45
evenementen en manifestaties
1.47
-2.17
4662
0.00
Appendix 4.2
de Tweede Kamer
Tweede Kamer
de
en techniek
wetenschap
1.94
37
36
187
188
Based on NLCOW14-subset
Based on News
report corpus
LogFreq.
phrase
LogFreq.
phrase
Surprisal
final word
LogFreq.
final word
46
het dagelijks leven
0.97
-0.20
213
1.50
47
op een gegeven moment
0.98
0.54
32
0.85
48
met terugwerkende kracht
0.58
0.26
77
1.03
49
in volle gang
0.66
0.05
96
0.60
50
een doorn in het oog
0.55
0.01
19
0.76
51
op geen enkele wijze
0.19
0.15
30
1.23
52
aan het begin van het seizoen
0.00
-0.38
15
0.74
53
de lokale bevolking
0.30
-0.25
329
0.78
54
het centrum van de stad
0.42
-0.59
50
1.02
55
correcties en aanvullingen
0.32
-1.17
189
0.06
56
de opvang van asielzoekers
-0.05
-1.09
149
0.32
57
de traditionele partijen
-0.42
-1.06
617
0.97
58
op last van de rechter
-0.35
-2.17
27
0.60
59
in de huidige situatie
-0.20
-0.22
56
0.99
60
een onafhankelijke commissie
-0.09
-0.83
382
0.65
Based on NLCOW14-subset
Based on News
report corpus
LogFreq.
phrase
LogFreq.
phrase
Surprisal
final word
LogFreq.
final word
een criminele afrekening
-0.83
-1.87
1486
-0.13
62
de koninklijke loge
-0.90
-1.87
1318
-0.54
63
een ingrijpende herstructurering
-0.90
-1.69
895
-0.01
64
op weg naar de top
-0.71
-1.22
32
0.93
65
in het belang van het kind
-0.63
-0.57
16
1.01
66
aan de vooravond van een revolutie
-1.36
-1.87
46
0.40
67
de uitkomsten van het rapport
-1.30
-1.69
73
0.78
68
met hernieuwde energie
-1.46
-1.17
262
1.03
69
een ongekende vrijheid
-1.38
-2.17
886
0.79
70
een luxe jacht
-1.46
-2.17
1092
0.35
Appendix 4.2
61
189
190
Appendix 4.3
Average Stereotypy Scores for the Job ad stimuli
Stereotypy Scores
Cue
Recruiters
M (SD)
1
40 uur per
97.5 (15.8)
97.5 (15.8)
90.5 (29.7)
2
voor meer
58.2 (48.1)
58.3 (48.1)
55.4 (48.6)
3
kennis en
21.0 (29.2)
12.9 (23.7)
6.3 (19.1)
4
hoog in het
90.0 (30.4)
82.5 (38.5)
66.7 (47.7)
5
werving en
96.7 (0.0)
84.6 (32.4)
27.6 (44.2)
6
een vast
15.8 (19.8)
13.7 (22.0)
7
voor langere
47.9 (43.7)
64.9 (38.0)
8
het eerste
9
goede contactuele
10
11
2.4 (13.0)
Job-seekers
M (SD)
0.2
(0.6)
Inexperienced
(SD)
M
5.2
(8.3)
63.3 (40.4)
0.0
(0.1)
57.5 (50.1)
52.5 (50.6)
2.4 (15.4)
bij gebleken
74.8 (43.7)
29.9 (46.3)
2.4 (15.4)
academisch werk- en
85.0 (36.2)
57.5 (50.1)
0.0
12
een grote mate van
25.4 (32.2)
11.8 (25.5)
3.2 (14.3)
13
in een hecht
52.5 (50.6)
40.0 (49.6)
11.9 (32.8)
(0.0)
Stereotypy Scores
Cue
Recruiters
M (SD)
14
een persoonlijk
17.3 (23.4)
13.7 (21.9)
13.1 (21.5)
15
een sterk analytisch
95.0 (22.1)
80.0 (40.5)
66.7 (47.7)
16
met de mogelijkheid tot
33.9 (46.0)
22.0 (40.3)
17
in de breedste zin van het
100.0 (0.0)
95.0 (22.1)
18
met een afstand tot de
55.0 (50.4)
2.5 (15.8)
0.0
(0.0)
19
het geschetste
35.0 (48.3)
17.5 (38.5)
0.0
(0.0)
20
in de meest uiteenlopende
0.2
(0.4)
21
een vliegende
77.8 (38.3)
22
bewijs van goed
97.5 (15.8)
23
conform de geldende
13.9 (31.8)
24
met behoud van
25
bevoegd en
0.2 (0.4)
Job-seekers
M (SD)
0.1
(0.3)
70.5 (43.2)
100.0
(0.0)
Inexperienced
(SD)
M
1.0
(1.9)
78.6 (41.5)
13.9 (34.2)
76.2 (43.1)
4.1
(2.5)
25.8 (32.0)
9.7 (23.3)
0.0
(0.0)
22.5 (42.3)
17.5 (38.5)
7.1 (26.1)
Appendix 4.3
5.3 (15.6)
191
192
Stereotypy Scores
Cue
Recruiters
M (SD)
26
een integrale
12.9 (20.5)
27
naar aanleiding van de
28
eenvoudige administratieve
Job-seekers
M (SD)
13.4 (21.1)
9.4 (22.6)
6.5 (19.0)
50.0 (50.6)
50.0 (50.6)
Inexperienced
(SD)
M
0.2
(1.1)
0.0
(0.0)
28.6 (45.7)
29
een scherpe
30
buiten de geijkte
56.1 (36.4)
48.6 (40.8)
31
affiniteit met het
13.4 (17.7)
10.3 (17.5)
32
een internationale speler van
2.5 (15.8)
7.5 (26.7)
0.0
(0.0)
33
een flinke portie
0.0 (0.0)
0.9
0.8
(5.4)
34
met bewezen
2.5 (15.8)
2.5 (15.8)
35
een collegiale
14.1 (33.6)
11.8 (31.2)
9.0 (9.7)
9.4
(8.5)
(5.6)
5.1
(8.2)
4.3 (17.8)
8.8 (15.4)
2.4 (15.4)
0.0
(0.0)
Appendix 4.4
Average Stereotypy Scores for the News report stimuli
Stereotypy Scores
Cue
Recruiters
M (SD)
34.9 (18.2)
37
de Tweede
Tweede Kamer
de
en
wetenschap
38
verkeer en
36
Job-seekers
M (SD)
Inexperienced
(SD)
M
39.4 (16.9)
32.6 (17.2)
5.3 (20.5)
8.0 (23.0)
10.6 (28.3)
7.5 (21.0)
15.9 (25.5)
2.7
(7.5)
in elk
73.1 (42.7)
87.7 (29.6)
65.0 (46.4)
in de Verenigde
86.5 (32.9)
96.3 (15.5)
94.1 (21.3)
41
het openbaar
21.9 (36.2)
22.3 (36.6)
14.7 (31.2)
42
de negentiende
72.8 (42.6)
63.1 (46.9)
50.8 (49.0)
43
de raad van
30.7 (13.8)
27.5 (18.5)
24.0 (18.9)
44
aan de andere
98.0 (0.6)
95.7 (15.5)
98.3
(0.9)
45
evenementen en
0.0
(0.0)
46
het dagelijks
47
op een gegeven
48
met terugwerkende
97.5 (15.8)
97.5 (15.8)
92.9 (26.1)
49
in volle
11.4 (21.1)
10.0 (18.5)
7.9 (14.2)
0.0 (0.0)
0.0
(0.0)
34.4 (29.9)
46.1 (25.9)
100.0 (0.0)
95.0 (22.1)
38.5 (31.8)
100.0
(0.0)
Appendix 4.4
39
40
193
194
Stereotypy Scores
Cue
Recruiters
M (SD)
Job-seekers
M (SD)
50
een doorn in het
95.0 (22.1)
97.5 (15.8)
90.5 (29.7)
51
op geen enkele
53.7 (31.4)
61.9 (29.2)
60.7 (32.7)
52
aan het begin van het
3.6 (12.5)
5.7 (13.5)
7.5 (19.2)
53
de lokale
7.0 (12.5)
5.5 (10.3)
7.8 (12.0)
54
het centrum van de
54.7 (47.6)
45.2 (48.1)
68.0 (43.5)
55
correcties en
22.5 (42.3)
25.0 (43.9)
7.1 (26.1)
56
de opvang van
8.3 (24.3)
4.3 (17.7)
6.6 (20.8)
57
de traditionele
3.6 (7.6)
4.1
58
op last van de
0.0 (0.0)
7.5 (26.7)
9.5 (29.7)
59
in de huidige
19.2 (17.1)
12.8 (16.1)
16.5 (16.4)
(6.9)
Inexperienced
(SD)
M
1.2
(3.8)
Stereotypy Scores
Cue
Recruiters
M (SD)
Job-seekers
M (SD)
Inexperienced
(SD)
M
60
een onafhankelijke
(8.6)
7.0 (12.1)
61
een criminele
28.8 (44.5)
26.4 (43.3)
13.7 (33.9)
62
de koninklijke
13.6 (13.2)
11.6 (12.9)
17.8 (13.9)
63
een ingrijpende
64
op weg naar de
65
in het belang van het
66
aan de vooravond van een
67
de uitkomsten van het
63.6 (44.3)
77.4 (36.1)
68
met hernieuwde
12.5 (33.5)
10.0 (30.4)
69
een ongekende
2.2 (9.7)
70
een luxe
8.3 (8.8)
5.2 (10.9)
3.4
(5.7)
4.8
(4.2)
3.9 (13.9)
11.7 (24.7)
1.3
(8.5)
3.0 (12.0)
1.9 (10.9)
5.5 (4.2)
0.0 (0.0)
6.3
0.0
(8.9)
14.4 (13.8)
0.0
(0.0)
62.5 (44.7)
2.4 (15.4)
0.0
(0.0)
12.5 (18.1)
Appendix 4.4
1.4
(0.0)
1.6 (10.6)
195
196
Appendix 4.5
Mixed-effects logistic regression model fitted to the completion
task data
The stereotypy scores were not normally distributed. Therefore, it was not justified
to fit a linear mixed-effects model. We used a mixed-effects logistic regression
model (Jaeger 2008) instead. Per response, we indicated whether or not it
corresponded to a complement observed in the specialized corpora. By means of
a mixed logit-model, we investigated whether there are significant differences
across groups of participants and/or sets of stimuli in the proportion of responses
that correspond to a complement in the specialized corpora. We fitted this model
using the LMER function from the lme4 package in R (version 3.3.3; CRAN project;
R Core Team, 2017). GROUP, ITEMTYPE, and their interaction were included as fixed
effects, and participants and items as random effects. The fixed effects were
standardized. Random intercepts and random slopes for participants and items
were included to account for between-subject and between-item variation.33
A model with a full random effect structure was constructed following Barr,
Levy, Scheepers, and Tily (2013). A comparison with the intercept-only model
proved that the inclusion of the by-item random slope for GROUP and the byparticipant random slope for ITEMTYPE was justified by the data (χ2(7) = 174.83, p
< .001). Confidence intervals were estimated via parametric bootstrapping over
1000 iterations (Bates, Mächler, Bolker & Walker 2015).
In order to obtain all relevant comparisons of the three groups and the two
types of stimuli, we ran the model with different coding schemes and we report
99% confidence intervals (as opposed to the more common 95%) to correct for
multiple comparisons. Since the groups were not expected to differ systematically
in experience with News report word sequences, none of the groups forms a
natural baseline in this respect. As for the Job ad stimuli, from a usage-based
perspective, differences between Recruiters and Job-seekers are as interesting as
differences between Job-seekers and Inexperienced participants, or Recruiters
and Inexperienced participants. Therefore, we treatment-coded the factors, first
using Recruiters as the reference group for GROUP and Job ad stimuli as the
reference group for ITEMTYPE. The resulting model is summarized in Table 4.6. The
intercept represents the proportion of the Recruiters’ responses to the Job ad
stimuli that correspond to a complement in the Job ad corpus. This proportion
does not differ significantly from the proportion of their responses to the News
report items that correspond to a complement in the Twente News Corpus.
33
By-participant random slopes for GROUP were not included, as this was a
between-participants factor; by-item random slopes for ITEMTYPE were not
included, as this was a between-items factor.
Estimate
SE
z
99 % CI
-0.54, 1.65
-2.06, 0.97
-1.11, -0.26
-3.11, -1.64
0.36, 1.46
1.15, 3.09
(Intercept)
0.56
0.43
1.31
Itemtype_NewsReport
-0.56
0.60
-0.93
Group_Jobseekers
-0.69
0.17
-4.09
Group_Inexperienced
-2.38
0.29
-8.30
Itemtype_NewsReport x Group_ Jobseekers
0.91
0.21
4.36
Itemtype_ NewsReport x Group_Inexperienced
2.14
0.38
5.62
Note: Significance code: 0.01 ‘**’
**
**
**
**
Appendix 4.5
197
There are significant differences between the groups of participants on the Job
ad stimuli. Both the Inexperienced participants and the Job-seekers have
significantly lower proportions of responses to the Job ad stimuli that match a
complement in the Job ad corpus than the Recruiters. The model also reveals that
the difference between the proportions on the two types of stimuli is significantly
different across groups.
Table 4.6 Mixed-effects logistic regression model (family: binomial) fitted to the responses to the
completion task (0 = does not correspond to a complement in the specialized corpus; 1 = corresponds
to a complement in the specialized corpus), using Recruiters–Job ad stimuli as the reference condition.
198
**
**
**
**
-4.36
0.21
-0.91
Itemtype_ NewsReport x Group_ Recruiters
Note: Significance code: 0.01 ‘**’
3.79
1.23
Itemtype_NewsReport x Group_ Inexper.
4.09
0.17
0.69
Group_Recruiters
0.32
0.63
0.25
Group_ Inexperienced
0.56
0.35
-1.69
Itemtype_NewsReport
-6.78
-1.13, 0.86
-1.04, 1.72
-2.34, -1.04
0.25, 1.14
0.38, 2.07
-1.43, - 0.39
99 % CI
z
-0.13
-0.32
SE
0.40
Estimate
(Intercept)
Table 4.7
Mixed-effects logistic regression model (family: binomial) fitted to the responses to the
completion task (0 = does not correspond to a complement in the specialized corpus; 1
= corresponds to a complement in the specialized corpus), using Job-seekers–Job ad
stimuli as the reference condition.
To examine the remaining differences, we then used Job-seekers–Job ad stimuli
as the reference condition. The outcomes are summarized in Table 4.7. The
proportion of the Job-seekers’ responses to the Job ad items that correspond to
a complement in the Job ad corpus does not differ significantly from the
proportion of their responses to the News report items that match a complement
in the Twente News Corpus. Furthermore, the outcomes show that the Jobseekers’ responses to the Job ad stimuli were significantly more likely to
correspond to a complement in the Job ad corpus than the responses of the
Inexperienced participants. In addition, the model reveals that the difference
between the proportions on the two types of stimuli is significantly different for
the Inexperienced participants compared to the Job-seekers.
Mixed-effects logistic regression model (family: binomial) fitted to the responses to the
completion task (0 = does not correspond to a complement in the specialized corpus; 1 =
corresponds to a complement in the specialized corpus), using Inexperienced–News report
stimuli as the reference condition.
Estimate
SE
z
-0.24
0.45
-0.62
Itemtype_JobAd
-1.58
0.64
-2.47
Group_ Jobseekers
0.46
0.23
1.98
Group_Recruiters
0.24
0.27
0.88
Itemtype_ JobAd x Group_ Jobseekers
1.23
0.32
3.79
Itemtype_ JobAd x Group_ Recruiters
2.14
0.38
5.61
Note: Significance code: 0.01 ‘**’
-1.34, 0.84
-3.12, 0.01
-0.11, 1.04
-0.44, 0.92
0.38, 2.04
1.14, 3.11
**
**
Appendix 4.5
(Intercept)
99 % CI
199
Finally, we used Inexperienced-News report stimuli as the reference condition. The
outcomes, summarized in Table 4.8, show that the proportion of the
Inexperienced participants’ responses to the Job ad items that correspond to a
complement in the specialized corpus is not significantly different from the
proportion of their responses to the News report items that match a complement
in the specialized corpus. They also reveal that the three groups do not differ
significantly from each other in the proportion of responses to the News report
stimuli that match a complement in the specialized corpus.
Table 4.8
200
Appendix 4.6
task)
Linear mixed-effects models fitted to the voice onset times (VOT
We fitted linear mixed-effects models (Baayen et al. 2008), using the LMER
function from the lme4 package in R (version 3.3.2; CRAN project; R Core Team
2017), to the Voice Onset Times. First, we investigated whether there are
significant differences in VOTs across groups of participants and/or sets of
stimuli, similar to our analysis of the stereotypy scores. Subsequently, we
examined to what extent the VOTs can be predicted by word length, corpus-based
word frequency, presentation order, and different measures of word predictability.
In the first analysis, GROUP, ITEMTYPE, and their interaction were included as
fixed effects, and participants and items as random effects. The fixed effects were
standardized. We included random intercepts and slopes for participants and
items to account for between-subject and between-item variation.34
A model with a full random effect structure was constructed following Barr et
al. (2013). A comparison with the intercept-only model proved that the inclusion
of the by-item random slope for GROUP and the by-participant random slope for
ITEMTYPE was justified by the data (χ2(7) = 34.34, p < .001). The variance explained
by this model is 59% (R2m = .04, R2c = .59).35 Confidence intervals were estimated
via parametric bootstrapping over 1000 iterations (Bates et al. 2015).
In order to obtain all relevant comparisons of the three groups and the two
types of stimuli, we ran the model with different coding schemes and we report
99% confidence intervals to correct for multiple comparisons. We treatmentcoded the factors, first using Recruiters as the reference group for GROUP and Job
ad stimuli as the reference group for ITEMTYPE. The resulting model is summarized
in Table 4.9. The intercept represents the mean VOT of the Recruiters on the Job
ad stimuli. Subsequently, we used Job-seekers–Job ad stimuli as the reference
condition (Table 4.10), and finally Inexperienced-News report stimuli (Table 4.11).
The models reveal that none of the groups shows a significant difference
between VOTs on the News report items and VOTs on the Job ad items. The
Inexperienced do differ significantly from the Recruiters and Job-seekers in the
relationship between the two sets of items. The majority of the Recruiters and the
Job-seekers responded faster to the Job ad items than to the News report items
(as evidenced by the Recruiters’ and Job-seekers’ marks below the zero line in
34
By-participant random slopes for GROUP were not included, as this was a
between-participants factor; by-item random slopes for ITEMTYPE were not
included, as this was a between-items factor.
35 2
R m (marginal R² coefficient) represents the amount of variance explained by
the fixed effects; R2c (conditional R² coefficient) is interpreted as variance
explained by both fixed and random effects (i.e. the full model) (Johnson 2014).
Generalized linear mixed-effects model (family: Gaussian) fitted to the voice onset times,
using Recruiters–Job ad stimuli as the reference condition.
Estimate
SE
t
(Intercept)
0.522
0.017
30.27
Itemtype_NewsReport
0.020
0.016
1.24
Group_Jobseekers
0.009
0.019
0.50
Group_Inexperienced
-0.036
0.019
-1.93
Itemtype_NewsReport x Group_ Jobseekers
-0.011
0.006
-1.88
Itemtype_ NewsReport x Group_Inexperienced
-0.030
0.007
-4.12
Note: Significance code: 0.01 ‘**’
99 % CI
0.477, 0.566
-0.024, 0.064
-0.036, 0.057
-0.085, 0.013
-0.026, 0.004
-0.048, -0.011
**
Appendix 4.6
201
Figure 4.4). For the vast majority of the Inexperienced participants it is just the
other way around: they were faster on the News report stimuli compared to the
Job add stimuli. The mixed-effects models indicate that the Inexperienced
participants’ data pattern is significantly different from the Recruiters’ and the
Job-seekers’.
Table 4.9
202
Table 4.10 Generalized linear mixed-effects model (family: Gaussian) fitted to the voice onset times,
using Job-seekers–Job ad stimuli as the reference condition.
Estimate
SE
t
0.531
0.017
Itemtype_NewsReport
0.009
0.015
0.62
Group_ Recruiters
-0.009
0.019
-0.50
Group_Inexperienced
-0.045
0.018
-2.47
Itemtype_NewsReport x Group_ Recruiters
0.011
0.006
1.88
Itemtype_NewsReport x Group_ Inexperienced
-0.019
0.005
-3.43
(Intercept)
Note: Significance code: 0.01 ‘**’
32.09
99 % CI
0.488, 0.574
-0.028, 0.047
-0.058, 0.040
-0.094, 0.003
-0.004, 0.026
-0.034, -0.004
**
**
**
**
203
-3.43
0.005
-0.019
Itemtype_ JobAd x Group_ Jobseekers
Note: Significance code: 0.01 ‘**’
-4.12
3.53
0.018
Itemtype_ JobAd x Group_ Recruiters
0.007
0.064
-0.030
Group_ Jobseekers
3.56
0.066
Group_ Recruiters
0.018
0.016
0.010
Itemtype_JobAd
0.61
0.434, 0.520
-0.031, 0.048
0.016, 0.115
0.017, 0.111
-0.048, -0.011
-0.033, -0.005
99 % CI
t
0.476
28.70
SE
0.017
Estimate
(Intercept)
Table 4.11 Generalized linear mixed-effects model (family: Gaussian) fitted to the voice onset times,
using Inexperienced–News report stimuli as the reference condition.
**
Appendix 4.6
In the second analysis, we investigated to what extent the VOTs can be predicted
by various characteristics of the target words. We included the length of the target
word in letters (WORDLENGTH), and its lemma-frequency, residualized against word
length (rLOGFREQ), as they are known to affect naming times. In addition, we
examined possible effects of PRESENTATIONORDER and BLOCK, as artifacts of our
experimental design. Furthermore, we investigated three different
operationalizations of word predictability. GENERICSURPRISAL is the surprisal of the
target word given the cue, estimated by language models trained on the generic
204
corpus meant to reflect Dutch readers’ overall experience. CLOZEPROBABILITY
amounts to the percentage of participants that complemented the cue with the
target word in the completion task preceding the VOT task. The binary variable
TARGETMENTIONED indicates whether or not the target word had been mentioned
by a given participant in the completion task. The fixed effects were standardized.
Participants and items were included as random effects. We incorporated a
random intercept for both items and participants to account for between-item and
between-participant variation. We then added fixed effects one by one and
assessed by means of likelihood ratio tests whether or not they significantly
contributed to explaining variance in voice onset times.
We started with WORDLENGTH (χ2(1) = 13.73, p < .001), followed by rLOGFREQ
2
(χ (1) = 4.78, p < .05), and PRESENTATIONORDER (χ2(1) = 3.97, p < .05). After that,
we added BLOCK (χ2(1) = 2.10, p = .15) and the interaction term PRESENTATIONORDER
x BLOCK (χ2(1) = 0.01, p = .93). Given that neither of the latter two improved model
fit, we left out these predictors. We then proceeded with the predictability
measures, starting with the most general one: GENERICSURPRISAL. This predictor
did not contribute to the fit of the model (χ2(1) = 2.54, p = .11) and therefore we
omitted it. CLOZEPROBABILITY did improve model fit (χ2(1) = 49.22, p < .001), as did
TARGETMENTIONED (χ2(1) = 309.37, p < .001). We then included the interaction term
rLOGFREQ x CLOZEPROBABILITY, which did not contribute to the fit of the model fit
(χ2(1) = 3.60, p = .06). rLOGFREQ x TARGETMENTIONED did explain a significant
portion of variance (χ2(1) = 16.75, p < .001). Finally, none of the two-way
interactions of PRESENTATIONORDER and the other predictors in the model was
found to improve model fit (PRESENTATIONORDER x TARGETMENTIONED (χ2(1) = 0.57,
p = .45); PRESENTATIONORDER x CLOZEPROBABILITY (χ2(1) = 0.65, p = .42);
PRESENTATIONORDER x rLOGFREQ (χ2(1) = 0.21, p = .65); PRESENTATIONORDER x
WORDLENGTH (χ2(1) = 2.58, p = .11)). The model selection procedure thus resulted
in a model comprising WORDLENGTH, rLOGFREQ, PRESENTATIONORDER,
CLOZEPROBABILITY, TARGETMENTIONED, and rLOGFREQ x TARGETMENTIONED.
We then added random slopes for participants. There are no by-item random
slopes, because each item has only one lemma frequency, one cloze probability,
one corpus-based surprisal estimate, one length, and a fixed position in the
presentation order. Furthermore, there are items no one had mentioned in the
completion task, thus prohibiting by-item random slopes for TARGETMENTIONED.
Within these limits, a model with a full random effect structure was constructed
following Barr et al. (2013). Subsequently, we excluded random slopes with the
lowest variance step by step until a further reduction would imply a significant
loss in the goodness of fit of the model (Matuschek et al. 2017). Model
comparisons indicated that the inclusion of the by-participant random slopes for
WORDLENGTH, PRESENTATIONORDER, CLOZEPROBABILITY, and TARGETMENTIONED was
Appendix 4.6
205
justified by the data (χ2(5) = 53.00, p < .001). Then, confidence intervals were
estimated via parametric bootstrapping over 1000 iterations (Bates et al. 2015).
We first ran the model using Target not mentioned as the reference condition and
then Target mentioned. The outcomes are presented in Table 4.5 in Section 4.4.2.
206
Appendix 5.1
Mean standardized familiarity ratings for the Job ad stimuli
Familarity ratings
Cue
Recruiters
M (SD)
Job-seekers
M (SD)
Inexperienced
(SD)
M
1
40 uur per week
0.99 (0.77)
0.97 (0.56)
1.13
(0.57)
2
voor meer informatie
0.70 (0.88)
0.79 (0.71)
0.94
(0.77)
3
kennis en ervaring
0.69 (0.79)
0.56 (0.85)
0.37
(0.75)
4
hoog in het vaandel
0.20 (0.69)
0.11 (0.79)
0.17
(0.86)
5
werving en selectie
1.19 (0.65)
0.90 (0.61)
-0.21
(0.85)
6
een vast dienstverband
0.99 (0.61)
0.52 (0.66)
-0.06
(0.62)
7
voor langere tijd
0.46 (0.75)
0.31 (0.67)
0.75
(0.54)
8
het eerste aanspreekpunt
0.41 (0.71)
0.23 (0.54)
-0.11
(0.70)
9
goede contactuele eigenschappen
0.58 (0.88)
0.20 (1.06)
-0.52
(0.84)
10
bij gebleken geschiktheid
0.28 (0.89)
-0.03 (0.73)
-0.60
(0.62)
11
academisch werk- en denkniveau
0.69 (0.58)
0.32 (0.65)
-0.32
(0.63)
12
een grote mate van zelfstandigheid
0.49 (0.52)
0.25 (0.68)
0.03
(0.74)
Familarity ratings
Cue
Recruiters
M (SD)
Job-seekers
M (SD)
Inexperienced
(SD)
M
in een hecht team
0.62 (0.68)
0.47 (0.64)
0.66
(0.60)
14
een persoonlijk ontwikkelingsplan
0.34 (0.78)
0.13 (1.44)
-0.45
(0.72)
15
een sterk analytisch vermogen
0.88 (0.51)
0.75 (0.76)
0.17
(0.81)
16
met de mogelijkheid tot verlenging
0.53 (0.83)
0.26 (0.79)
-0.13
(0.61)
17
in de breedste zin van het woord
0.14 (0.82)
0.62 (1.00)
0.55
(0.77)
18
met een afstand tot de arbeidsmarkt
0.07 (0.60)
-0.82 (0.82)
-1.05
(0.54)
19
het geschetste profiel
0.33 (0.67)
0.12 (0.82)
-0.12
(0.65)
20
in de meest uiteenlopende sectoren
-0.38 (0.67)
-0.50 (0.67)
-0.84
(0.49)
21
een vliegende start
0.05 (0.91)
0.35 (1.06)
0.64
(0.83)
22
bewijs van goed gedrag
0.36 (0.71)
0.29 (0.75)
-0.03
(0.72)
23
conform de geldende CAO
0.33 (0.85)
0.03 (0.91)
-0.97
(0.68)
24
met behoud van uitkering
0.02 (0.78)
-0.31 (0.80)
-0.94
(0.48)
Appendix 5.1
13
207
208
Familarity ratings
Cue
Recruiters
M (SD)
Job-seekers
M (SD)
Inexperienced
(SD)
M
25
bevoegd en bekwaam
-0.05 (0.81)
-0.23 (0.76)
-0.29
(0.73)
26
een integrale benadering
-0.70 (0.71)
-0.62 (1.00)
-1.33
(0.50)
27
naar aanleiding van de advertentie
0.28 (0.93)
0.28 (0.79)
0.09
(0.55)
28
eenvoudige administratieve werkzaamheden
0.17 (0.74)
-0.01 (0.78)
-0.13
(0.62)
29
een scherpe blik
0.03 (0.65)
0.56 (0.89)
0.62
(0.72)
30
buiten de geijkte paden
-0.34 (0.92)
-0.60 (1.07)
-1.14
(0.75)
31
affiniteit met het onderwerp
0.02 (0.73)
0.35 (0.67)
-0.29
(1.06)
32
een internationale speler van formaat
-0.83 (0.80)
-0.76 (0.89)
-0.44
(1.05)
33
een flinke portie lef
-0.49 (0.82)
-0.14 (0.72)
-0.26
(0.67)
34
met bewezen kwaliteiten
-0.08 (1.00)
-0.23 (0.94)
-0.48
(0.61)
35
een collegiale opstelling
-0.18 (0.78)
-0.30 (0.81)
-0.86
(0.68)
Appendix 5.2
Mean standardized familiarity ratings for the News report stimuli
Familarity ratings
Cue
Recruiters
M (SD)
Job-seekers
M (SD)
Inexperienced
(SD)
M
0.74 (0.79)
1.23
(0.89)
-0.72 (1.03)
-0.50 (0.80)
-0.51
(0.65)
38
verkeer en vervoer
-0.63 (1.00)
-0.52 (0.91)
-0.46
(0.86)
39
in elk geval
0.67 (0.73)
0.83 (0.74)
1.44
(0.55)
40
in de Verenigde Staten
0.10 (0.84)
0.67 (0.67)
1.38
(0.78)
41
het openbaar ministerie
0.03 (0.95)
0.42 (1.10)
0.80
(0.66)
42
de negentiende eeuw
-0.15 (1.11)
0.18 (1.21)
1.04
(0.98)
43
de raad van bestuur
-0.02 (0.89)
0.37 (0.86)
0.32
(0.79)
44
aan de andere kant
0.53 (0.75)
0.67 (0.65)
1.07
(0.53)
45
evenementen en manifestaties
-1.17 (0.81)
-1.17 (0.86)
-1.00
(0.55)
46
het dagelijks leven
0.34 (0.77)
0.59 (0.59)
1.12
(0.61)
47
op een gegeven moment
0.35 (1.03)
0.66 (0.71)
1.37
(0.91)
Appendix 5.2
de Tweede Kamer
Tweede Kamer
de
en techniek
wetenschap
0.59 (0.80)
37
36
209
210
Familarity ratings
Cue
48
Recruiters
M (SD)
Job-seekers
M (SD)
met terugwerkende kracht
0.25 (0.79)
0.41 (0.74)
Inexperienced
(SD)
M
0.33
(0.68)
49
in volle gang
0.03 (0.84)
0.38 (0.80)
0.77
(0.74)
50
een doorn in het oog
0.01 (0.94)
0.25 (1.03)
0.05
(0.96)
51
op geen enkele wijze
-0.26 (0.80)
-0.09 (0.71)
0.18
(0.67)
52
aan het begin van het seizoen
-0.17 (0.78)
-0.30 (0.72)
0.28
(0.60)
53
de lokale bevolking
0.06 (0.88)
0.01 (0.69)
0.46
(0.63)
54
het centrum van de stad
0.22 (0.86)
0.16 (0.90)
0.72
(0.70)
55
correcties en aanvullingen
-0.39 (0.73)
-0.26 (1.40)
-0.07
(0.64)
56
de opvang van asielzoekers
-0.12 (0.85)
-0.06 (0.69)
0.02
(0.68)
57
de traditionele partijen
-0.62 (1.63)
-0.63 (0.67)
-0.53
(0.74)
58
op last van de rechter
-0.96 (0.82)
-0.72 (0.71)
-1.00
(0.70)
59
in de huidige situatie
0.34 (0.75)
0.32 (0.48)
0.67
(0.57)
Familarity ratings
Cue
Job-seekers
M (SD)
Inexperienced
(SD)
M
een onafhankelijke commissie
-0.53 (0.74)
-0.39 (0.65)
-0.76
(0.55)
61
een criminele afrekening
-0.93 (0.83)
-0.56 (0.84)
-0.42
(0.75)
62
de koninklijke loge
-1.39 (0.80)
-1.72 (0.87)
-1.31
(0.67)
63
een ingrijpende herstructurering
-0.78 (1.04)
-0.65 (0.69)
-0.72
(0.49)
64
op weg naar de top
-0.04 (0.82)
-0.14 (0.79)
0.29
(0.58)
65
in het belang van het kind
-0.43 (1.01)
-0.17 (0.80)
0.35
(0.77)
66
aan de vooravond van een revolutie
-1.21 (0.67)
-1.08 (0.87)
-1.38
(0.50)
67
de uitkomsten van het rapport
-0.10 (0.63)
-0.27 (0.72)
0.00
(0.62)
68
met hernieuwde energie
-0.52 (1.18)
-0.84 (1.35)
-0.36
(0.73)
69
een ongekende vrijheid
-0.44 (1.02)
-0.67 (0.87)
-0.16
(0.62)
70
een luxe jacht
-0.74 (0.73)
-0.72 (0.88)
0.24
(0.66)
60
Appendix 5.2
Recruiters
M (SD)
211
212
Appendix 5.3
Linear mixed-effects models fitted to standardized familiarity
ratings (Magnitude Estimation task)
We fitted linear mixed-effects models (Baayen et al. 2008), using the LMER
function from the lme4 package in R (version 3.3.3; CRAN project; R Core Team,
2017), to the standardized familiarity ratings. We investigated to what extent
these ratings can be predicted by corpus-based phrase frequency
(LOGFREQPHRASE) and lemma frequency of the final word in the phrase
(LOGFREQLEMMA), whether or not a participant expected the final word to occur
given the preceding words (TARGETMENTIONED); and the time it took the participant
to start pronouncing the target word when presented following the cue (VOT). In
addition, we examined whether there are effects of PRESENTATIONORDER and BLOCK.
The fixed effects were standardized. We incorporated a random intercept for
items to account for between-item variation. A by-participant random intercept
was not included, because after the Z-score transformation all participants’
scores have a mean of 0. We then added fixed effects one by one and assessed
by means of likelihood ratio tests whether or not they significantly contributed to
explaining variance in familiarity ratings.
We started with LOGFREQPHRASE, which significantly contributed to the fit of the
model (χ2(1) = 35.14, p < .001). We then added LOGFREQLEMMA (χ2(1) = 2.50, p =
.11). Given that it did not improve model fit, we left out this predictor. We
proceeded with TARGETMENTIONED (χ2(1) = 283.00, p < .001), followed by VOT
(χ2(1) = 8.90, p < .01.), each of which was found to improve model fit.
PRESENTATIONORDER did not contribute to the fit of the model (χ2(1) = 3.77, p = .06);
BLOCK did (χ2(1) = 6.30, p < .05). We then included the interaction terms
LOGFREQPHRASE x TARGETMENTIONED (χ2(1) = 16.37, p < .001), and VOT x
TARGETMENTIONED (χ2(1) = 7.78, p < .01). Finally, none of the two-way interactions
of BLOCK and the other predictors in the model was found to improve model fit
(BLOCK x LOGFREQPHRASE (χ2(1) = 2.56, p = .11); BLOCK x TARGETMENTIONED (χ2(1) =
0.22, p = .64); BLOCK x VOT (χ2(1) = 0.25, p = .62)).
The model selection procedure thus resulted in a model comprising
LOGFREQPHRASE, TARGETMENTIONED, VOT, BLOCK, LOGFREQPHRASE x TARGETMENTIONED,
and VOT x TARGETMENTIONED. For all of these fixed effects, we included a byparticipant random slope. For the factor VOT we also added a by-item random
slope. There are no other by-item random slopes, because each item has only one
phrase frequency, and occurred in only one of the two blocks. Furthermore, there
are items no one had mentioned in the completion task, thus prohibiting by-item
random slopes for TARGETMENTIONED. Within these limits, a model with a full
random effect structure was constructed following Barr et al. (2013).
Subsequently, we excluded random slopes with the lowest variance step by step
Appendix 5.3
213
until a further reduction would imply a significant loss in the goodness of fit of the
model (Matuschek et al. 2017). Model comparisons indicated that the inclusion
of the by-participant random slopes for LOGFREQPHRASE, TARGETMENTIONED, and
BLOCK was justified by the data (χ2(4) = 97.25, p = .001). The variance explained
by this model is 37% (R2m = .18, R2c = .37).36 Confidence intervals were estimated
via parametric bootstrapping over 10000 iterations (Bates et al. 2015). We first
ran the model using Target not mentioned as the reference condition and then
Target mentioned. The outcomes are presented in Tables 5.1 and 5.2 in Section
5.4.
36 2
R m (marginal R² coefficient) represents the amount of variance explained by
the fixed effects; R2c (conditional R² coefficient) is interpreted as variance
explained by both fixed and random effects (i.e. the full model) (Johnson 2014).
214
Summary
215
Summary
This dissertation presents research into variation between and within participants
in their metalinguistic judgments about, and processing of, multi-word sequences.
Numerous studies provide evidence that language users are sensitive to the
likelihood of words to co-occur and that they make use of this information in
language acquisition and processing (for overviews see Diessel 2007; Gries &
Divjak 2012; Jurafsky et al. 2001; Kuperberg & Jaeger 2016). The more frequently
a string of words is used, the more quickly and easily the sequence is retrieved
and processed and the more familiar it is considered to be. This suggests that
usage frequency affects our mental representations of language: more experience
with a linguistic construction makes it more strongly entrenched in the speaker’s
mental lexicon, which in turn influences the probability that the construction will
be used, the speed with which it is processed, and the speaker’s metalinguistic
knowledge regarding its use.
If usage-based models of linguistic representations (Barlow & Kemmer 2000;
Bybee 2006; Goldberg 2006; Langacker 1987; Schmid 2007; Tomasello 2003) are
correct in positing such a strong link between usage frequency and entrenchment,
it follows that the extent to which a linguistic construction is entrenched varies
from person to person, as well as over time. That is, since language users differ
in their linguistic experiences, there are likely to be differences in entrenchment
across individuals. Furthermore, a language user gains new linguistic experiences
over time, and usage-based linguistics predicts mental representations of
language to change accordingly. There is a shortage of empirical data on these
types of variation, though. As I discuss in more detail in Chapter 1, the past five
decades have seen a wealth of studies yielding evidence in support of usagebased theories of language acquisition and processing, but these studies have
paid little attention to inter- and intra-individual variation in adult native speakers.
A central aim of the studies presented in this dissertation is to show that insight
into these types of variation is a prerequisite for a veridical description of mental
representations of language.
Chapters 2 and 3 present two studies that examine inter- and intra-individual
variation in metalinguistic judgments. The latter was investigated by means of a
test-retest design: participants performed the same task twice within the space
of one to three weeks. In both studies, native speakers of Dutch were asked to
assign familiarity ratings to a set of prepositional phrases that cover a wide range
of corpus frequencies (e.g. op de bank ‘on the couch / in the bank’, in de lucht ‘in
the air’). In the study reported on in Chapter 2, 44 phrases were presented in
216
isolation as well as in a sentential context, to investigate whether context affects
perceived degree of familiarity and inter- and intra-individual variation in
judgments. The participants assigned ratings using the method of Magnitude
Estimation (Bard et al. 1996). Aggregated scores (averaged over 86 participants)
are remarkably consistent (Pearson’s r = .97), and there is a significant
relationship between familiarity ratings and corpus frequencies of the phrases. At
the same time, there is considerable variation between and within participants.
Context does not reduce this variation. As random noise does not seem to
account for the patterns of variation in the data, I propose to consider the
possibility that intra-individual variation is a genuine property of one’s
metalinguistic representations and ultimately one’s linguistic representations.
This implies that the difference between people’s ratings at one point in time
cannot be interpreted straightforwardly as the difference in their linguistic
representations. A more complete and more faithful impression requires multiple
measurements.
Chapter 3 starts by describing how, in various fields of linguistics, variation has
been overlooked, looked at from a limited perspective (e.g. variation being simply
the result of irrelevant performance factors), or considered troublesome. I then
argue that it is both feasible and valuable to study different types of variation. To
illustrate this, I conducted an experiment in which 91 participants assigned
familiarity ratings to 79 prepositional phrases. They performed the task twice
within a couple of weeks, using either a 7-point Likert scale or a Magnitude
Estimation scale. The research design employed here thus yielded data on
variation across items, across participants, across time, and across rating
methods. I explicate the principles according to which the different types of
variation can be considered information about mental representation, and I show
how they can be used to test hypotheses regarding linguistic representations.
The results indicate that familiarity judgments form methodologically reliable,
useful data in linguistic research. The ratings obtained with one scale were
corroborated by the ratings on the other scale. In addition, there was a near
perfect Time1–Time2 correlation of the mean ratings in all experimental
conditions, and in all conditions the majority of the participants had high selfcorrelation scores. Furthermore, the data show a clear correlation between
familiarity ratings and corpus frequencies.
Similar to the dataset analyzed in Chapter 2, the familiarity ratings display interand intra-individual variation. Usage-based exemplar models (Goldinger 1996;
Hintzman 1986; Pierrehumbert 2001) naturally accommodate such variation. In
these models, linguistic representations consist of a continually updating set of
exemplars. An exemplar is not a tape recording stored in memory, but a
Summary
217
multidimensional, detail-rich representation that follows from a process of
analysis and categorization (Taylor 2012). While the judgment task requires
people to indicate the position of a given item on a scale of familiarity by means
of a single value, its familiarity for a particular speaker may best be viewed as a
moving target located in a region that may be narrower or wider. In that case,
there is not just one true value, but a range of scores that constitute true
expressions of an item’s familiarity. Variation in judgment across time is not noise
then, but a reflection of the dynamic character of cognitive representations as
more, or less, densely populated clouds of exemplars that vary in strength
depending on frequency and recency of use.
Chapter 3 concludes with a discussion of the similarities and differences
between Magnitude Estimation (ME) and Likert scale ratings. In several respects,
the two scales yielded similar outcomes, but there are also differences that ought
to be taken into account when selecting a particular scale. Likert ratings, unlike
ME ratings, make it possible to determine whether participants consider the
majority of items to be familiar (or unfamiliar), and whether they consider the
entire set of stimuli more familiar the second time (as a result of the exposure in
the test sessions, for example). A disadvantage of using a Likert scale is the risk
that the number of response options does not match the degrees of familiarity as
perceived by the participants, which could result in a loss of information. ME
allows participants to distinguish as many degrees as they feel relevant. When
using ME, the vast majority of the participants in the study reported here (83.3%)
distinguished more than seven degrees, indicating that a 7-point Likert scale may
not be optimal for the construct and the set of stimuli used here.
In Chapter 4 and 5, I examine inter- and intra-individual variation by means of three
experiments that I conducted with three groups of participants: 40 recruiters, 40
job-seekers, and 42 people not (yet) looking for a job (henceforth referred to as
Inexperienced). These groups can be expected to differ in experience with word
sequences that typically occur in job ads (e.g. goede contactuele eigenschappen
‘good communication skills’); they are not expected to differ systematically in
experience with word sequences characteristic of news reports (e.g. de Tweede
Kamer ‘the House of Representatives’). The word sequences were used as stimuli
in a completion task, a voice onset time (VOT) experiment, and a familiarity
judgment task. I thus examined the relationship between amount of experience
with a particular register and (i) the expectations people generate about
upcoming words when faced with word strings characteristic of that register; (ii)
the speed with which they process such word strings; and (iii) how familiar they
consider these word strings to be. Furthermore, I investigated the relationships
between data elicited from an individual participant in different types of
218
psycholinguistic tasks using the same stimuli. More specifically, I compared
participant-based measures, on the one hand, and measures based on
amalgamated data of different people, on the other, as predictors of performance
in psycholinguistic tasks. This provides insight into individual variation and the
merits of going beyond amalgamated data.
Chapter 4 reports on the completion task and the VOT task. In the completion
task, the participants were shown incomplete phrases (e.g. goede contactuele …
‘good communication …’) and for each stimulus they listed all complements that
came to mind within five seconds. This task yielded information on the
expectations people generate about upcoming words. Their responses were
compared with the complements observed in a job ad corpus and the Twente
News Corpus. The analyses revealed that on the News Report items, the groups
did not differ significantly from each other in the proportion of responses that
correspond to a complement observed in the Twente News Corpus. On the Job
ad stimuli, by contrast, the groups did differ significantly, as hypothesized. The
Recruiters’ responses corresponded significantly more often to complements
observed in the Job ad corpus than the Job-seekers’ responses. The Job-seekers’
responses, in turn, corresponded significantly more often to a complement in the
Job ad corpus than the responses of the Inexperienced participants. The results
indicate that there are differences in participants’ knowledge of multi-word units
which are related to their degree of experience with these word sequences.
In the subsequent VOT experiment, the participants were presented with the
same cues (e.g. goede contactuele … ‘good communication …’), followed by a
specific target word (e.g. eigenschappen ‘skills’), which they had to read aloud as
quickly as possible. The voice onset times indicate how much time it takes to
process the target word in the given context. According to prediction-based
processing models (Bar 2007; A. Clark 2013; Huettig 2015; Kuperberg & Jaeger
2016; Kutas et al. 2011), the target will be easier to recognize and process when
it consists of a word that the participant expected than when it consists of an
unexpected word. Most studies to date quantify a word’s predictability by means
of cloze probabilities and surprisal estimates, which are based on amalgamations
of data of various speakers and thus disregard variation across speakers. Having
participants perform both a completion task and a VOT task made it possible to
relate reaction times to participants’ own expectations.
Firstly, the analyses revealed that the majority of the Recruiters and the Jobseekers responded faster to the Job ad items than to the News report items, while
it was exactly the other way around for the vast majority of the Inexperienced
participants. I then examined to what extent variation in VOTs across items and
across participants could be explained by different measures of word
Summary
219
predictability, while accounting for characteristics of the target words (i.e. word
length and word frequency) and the experimental design (i.e. presentation order
and block). Whether or not participants had mentioned the target significantly
affected voice onset times. What is more, this predictive pre-activation, as
captured by the variable TARGETMENTIONED, was found to facilitate processing to
such an extent that word frequency could not exert any additional accelerating
influence. This demonstrates the impact of context-sensitive prediction on
subsequent processing. Perhaps even more interesting is that the variable
TARGETMENTIONED had an effect on voice onset times over and above the effect of
CLOZEPROBABILITY. This shows the added value of going beyond amalgamated
data. While this may not come across as surprising, it is seldomly shown or
exploited in research on prediction-based processing.
After having completed the VOT task, the participants assigned familiarity ratings
to the word sequences using Magnitude Estimation. In Chapter 5, I analyze the
judgment data in relation to the data from the completion task and the VOT task
as well as corpus frequencies. In this way, I examine whether the degree to which
linguistic constructions are entrenched in the participants’ minds manifests itself
not just in processing but also in metalinguistic judgments. In other words, are
these degrees of entrenchment part of one’s explicit knowledge and can
metalinguistic judgments be used to gain insight into entrenchment? On the one
hand, “judgments are the results of linguistic and cognitive processes, by which
people attempt to process sentences and then make metalinguistic judgments on
the results of those acts of processing (…) Thus, they implicate the same linguistic
representations involved in all acts of processing”, as Branigan and Pickering
(2017: 4) contend. On the other hand, judgments may be influenced by knowledge
and beliefs (Dąbrowska 2016a) and reflect decision-making biases (Branigan &
Pickering 2017) which are not involved in language processing. Various
researchers are concerned that introspections cannot yield accurate insights into
subconscious cognitive processes (e.g. Gibbs 2006; Roehr 2008; Stubbs 1993).
Prior research has examined the relationship between familiarity ratings and
various kinds of psycholinguistic data. A limitation of those studies is that the sets
of familiarity ratings come from different people than the datasets indicating
performance in processing tasks. Consequently, we cannot tell whether a
discrepancy between familiarity judgments and processing data reflects the fact
that different tasks tap into different processes and knowledge, or whether it
reflects individual variation in linguistic representations. By having participants
perform both a judgment task and processing tasks, I was able to differentiate
between the two.
220
Firstly, the results show that differences in experiences with a particular
register were reflected in the familiarity ratings that participants assigned to
phrases characteristic of that register. The vast majority of the Recruiters
considered the Job ad phrases to be more familiar than the News report phrases,
while for the Inexperienced participants it was the other way around. Secondly,
individual participants’ data from the completion task and the VOT task are
significant predictors of the familiarity ratings they assigned to the stimuli. This
indicates that familiarity judgments and other types of psycholinguistic data tap
into the same mental representations of language, and that familiarity ratings
form useful data to gain insight into these representations.
The dissertation concludes with two chapters in which I reflect on the studies I
conducted. In Chapter 6, I focus on the methodological lessons that can be
learned from them. The chapter highlights the merits of multi-method research in
linguistics and offers an overview of key considerations in the design of such
research. It discusses methodological and practical concerns in the selection of
corpus data, metrics to analyze corpus data, stimuli, experimental tasks, and
participants.
In Chapter 7, I consider the theoretical implications of my findings. The results
indicate that there are systematic differences in participants’ knowledge and
processing of multi-word units which are related to their degree of experience with
these word sequences. This forms empirical support for hypotheses that follows
from usage-based theories of linguistic knowledge and language processing.
Furthermore, an individual’s performance in one experiment was shown to be a
significant predictor of performance in another experiment, on top of measures
based on amalgamated data of different people (i.e. corpus-based frequencies,
surprisal, cloze probabilities). In other words, participant-based measures proved
to have unique additional explanatory power. This demonstrates the existence of
systematic, measurable inter-individual variation in behavioral indices of cognitive
routinization. Variation is ubiquitous, but, crucially, not random. One of the
important tasks that we face when we want to arrive at accurate theories of
linguistic representation and processing is to define the factors that determine
the degrees of variation between individuals, and this requires going beyond
amalgamated data.
In addition to inter-individual variation, there is evidence of intra-individual
variation which, too, points to the dynamic character of mental representations of
language. Most psycholinguistic tasks that try to tap into the degree of
entrenchment of a linguistic unit in the mind of a speaker, express this in a single
value (e.g. a rating, a reaction time). However, if cognitive representations can
best be viewed as more, or less, densely populated clouds of exemplars that vary
Summary
221
in strength depending on frequency and recency of use, a single score yields an
incomplete picture. Therefore, I not only advocate attending to variation across
participants, I also urge cognitive linguists to carry out multiple measurements
per participant.
To conclude, I sketch three compelling directions for future research that build
on the work presented in this dissertation. I propose, first of all, to further develop
participant-based measures. In my studies, I converted the completion task
responses into a variable that indicates for each participant individually whether
the target word had been mentioned or not (TARGETMENTIONED). It proved to be a
valuable measure. However, as a binary variable, it does not account for gradient
differences in the degree to which words are expected to occur. I provide
suggestions as to how the potential of participant-based data can be explored.
Secondly, I propose to follow participants in the course of a few weeks or
months, extending the test-retest design. This can provide additional insights into
the effects of usage frequency on processing speed and perceived degree of
familiarity. It is clear by now that frequency is a key factor. What is not so clear,
is to what extent recency of use matters; whether it makes a difference whether
you used a linguistic item once or twice that day; and whether this works
differently for low-frequency items compared to high-frequency ones.
Thirdly, I propose to examine (partially) schematic constructions in addition to
lexically specific ones. On a usage-based account, mental representations of
(partially) schematic constructions are dynamic in nature too, just like
representations of lexically specific constructions such as morphemes, complex
words, and multi-word units. All representations are taken to emerge from, and
are continuously shaped by, experience with language together with general
cognitive skills such as categorization, schematization, and chunking. However,
schematic constructions tend to have a more general meaning, a wider range of
usage contexts, and a higher frequency of occurrence than lexically specific
constructions, which may result in less inter- and intra-individual variability. What
should also be taken into account, is that speakers may differ in cognitive abilities,
such as language analytic ability, statistical learning ability, fluid intelligence, and
cognitive motivation (Dąbrowska 2018; Misyak & Christiansen 2012). Both
linguistic experiences and cognitive abilities appear to influence the process of
schematization and speakers’ knowledge of grammatical constructions. There
are indications that this does not hold for collocational knowledge in the same
way (Dąbrowska 2018). While representations of words, multi-word units, and
grammatical patterns can still be construed as constructions that emerge from
linguistic experience together with general cognitive skills, they may differ in the
extent to which they rely on various cognitive and experiential factors. Research
that aims to advance our understanding of the contributions of these factors must
222
pay attention to individual differences. I hope this dissertation contributes to this
research agenda by demonstrating that it is feasible and valuable to attend to
inter- and intra-individual variation and by sparking linguists’ enthusiasm for such
an approach.
Samenvatting
223
Samenvatting
Stel, aan een groep mensen wordt de zin Bij gelijke geschiktheid gaat onze
voorkeur uit naar een vrouwelijke kandidaat voorgelegd. In hoeverre verschillen zij
van elkaar in de manier waarop ze deze zin verwerken, en kunnen we deze
verschillen verklaren? Lange tijd beschouwden taalkundigen woorden en
grammaticale regels als de bouwstenen in taal. In de afgelopen vijftig jaar is
echter duidelijk geworden dat dat niet volstaat als beschrijving van de mentale
organisatie van taal. We beschikken over een veel gevarieerdere set aan talige
eenheden. Een zin als Bij gelijke geschiktheid gaat onze voorkeur uit naar een
vrouwelijke kandidaat kan geproduceerd en begrepen worden door de losse
woorden en de syntactische structuur waarin ze zijn ingebed te activeren, maar
taalgebruikers kunnen ook grotere eenheden gebruiken. Ze kunnen bijvoorbeeld
gebruik maken van woordcombinaties (multi-word units zoals bij gelijke
geschiktheid) en gedeeltelijk schematische eenheden (zoals gaat
LIDWOORD/BEZITTELIJK VNW voorkeur uit naar NAAMWOORDGROEP). Psycholinguïstisch
onderzoek heeft aangetoond dat sommige van dergelijke constructies sneller
worden verwerkt, gemakkelijker worden herinnerd, en vertrouwder aandoen dan
andere. Dit suggereert dat ze verschillen in de mate waarin ze verankerd zijn in
onze taalkennis – met andere woorden, de mate van entrenchment varieert.
Gebruiksfrequentie lijkt een sleutelrol te spelen in het proces van entrenchment:
hoe vaker een talige constructie gebruikt wordt, hoe sterker deze verankerd wordt
in het mentale lexicon van de taalgebruiker, waardoor het makkelijker wordt om
de constructie te activeren en te verwerken.
Gebruiksgebaseerde modellen van mentale representaties van taal (Barlow &
Kemmer 2000; Bybee 2006; Goldberg 2006; Langacker 1987; Schmid 2007;
Tomasello 2003) stellen dat er een sterk verband is tussen gebruiksfrequentie en
entrenchment. Als dit werkelijk zo is, dan varieert de mate waarin in een
constructie verankerd is zowel van persoon tot persoon, als in de loop der tijd.
Variatie in entrenchment tussen mensen komt voort uit het feit dat taalgebruikers
van elkaar verschillen in de frequentie waarmee ze bepaalde constructies
tegenkomen en gebruiken. Variatie door de tijd heen volgt uit het feit dat
taalgebruikers nieuwe ervaringen met taal opdoen gedurende hun leven. Volgens
gebruiksgebaseerde modellen veranderen mentale representaties van taal mee:
toenemend gebruik leidt tot sterkere verankering; de representatie verzwakt als
een constructie een tijd lang niet gebruikt wordt (Langacker 1987: 59). Empirische
data over deze vormen van variatie zijn echter schaars. In Hoofdstuk 1 beschrijf
ik dat er in de laatste vijf decennia veel onderzoek heeft plaatsgevonden waarvan
de uitkomsten in lijn zijn met gebruiksgebaseerde theorieën over taalverwerving
224
en –verwerking. Het merendeel van deze studies heeft echter weinig aandacht
besteed aan variatie tussen en binnen volwassen moedertaalsprekers. Het doel
van de studies in dit proefschrift is aan te tonen dat inzicht in deze typen variatie
noodzakelijk is om te komen tot een waarheidsgetrouwe beschrijving van mentale
representaties van taal. Ik doe dit door de variatie tussen en binnen participanten
in metalinguïstische oordelen over, en verwerking van meerwoordsconstructies te
onderzoeken.
Hoofdstukken 2 en 3 rapporteren over twee studies naar inter- en intra-individuele
variatie in metalinguïstische oordelen (oordelen waarbij je reflecteert op taal,
taalgebruik, en taalkennis). Door de oordelentaak bij verschillende mensen af te
nemen is informatie verkregen over interindividuele variatie. Intra-individuele
variatie is onderzocht door deelnemers dezelfde taak twee keer te laten uitvoeren
in een periode van één tot drie weken. In beide studies hebben moedertaalsprekers
van het Nederlands vertrouwdheidsoordelen toegekend aan voorzetselgroepen
(bijv. op de bank, in de lucht). Deze woordcombinaties varieerden in de frequentie
waarmee ze voorkomen in een groot corpus van hedendaags Nederlands
taalgebruik. In de studie die beschreven wordt in Hoofdstuk 2 zijn 44
voorzetselgroepen gepresenteerd als losse woordcombinaties en tevens ingebed
in een zin, om na te gaan of context van invloed is op het gevoel van vertrouwdheid
en op de variatie in oordelen. De participanten kenden scores toe aan de hand van
een methode die Magnitude Estimation heet (Bard et al. 1996). De geaggregeerde
waardes, waarbij het gemiddelde werd genomen van de scores van 86
participanten, bleken opmerkelijk consistent (Pearson’s r = .97), en er was een
significant verband tussen de vertrouwdheidsscores en corpusfrequenties
(hogere frequenties gaan gepaard met hogere scores). Tegelijkertijd was er
sprake van aanzienlijke variatie tussen en binnen participanten in oordelen. Het
toevoegen van een zinscontext verminderde deze variatie niet. Er zijn taalkundigen
(bijv. Featherston 2007) die van mening zijn dat inter- en intra-individuele variatie
in metalinguïstische oordelen ruis is, die er uit gefilterd kan worden door met
geaggregeerde scores te werken. De variatie in mijn dataset vertoonde echter
patronen die niet verklaard lijken te kunnen worden in termen van willekeurige ruis.
Daarom stel ik voor om de mogelijkheid te overwegen dat intra-individuele variatie
een echt kenmerk is van metalinguïstische representaties en zelfs van alle soorten
talige representaties. Variatie in oordelen van moment tot moment zou een
reflectie kunnen zijn van de dynamiek van talige representaties. Dit impliceert dat
het verschil tussen de oordelen van twee mensen op één bepaald moment niet
zomaar beschouwd kan worden als hét verschil tussen hun mentale
representaties van taal. Op een ander moment kan het plaatje er namelijk anders
Samenvatting
225
uitzien. Voor een vollediger en waarheidsgetrouwer beeld zijn meerdere metingen
nodig.
In Hoofstuk 3 beschrijf ik hoe, in verscheidene gebieden binnen de taalkunde,
variatie over het hoofd werd gezien, beschouwd werd als simpelweg het gevolg
van irrelevante factoren (zoals beperkingen van het werkgeheugen en
vergissingen), of als lastig werd ervaren. Vervolgens bepleit ik dat het mogelijk en
waardevol is om verschillende typen variatie te bestuderen. Dit illustreer ik aan de
hand van een experiment waarbij 91 deelnemers 79 voorzetselgroepen
beoordeelden op vertrouwdheid. Ze voerden deze taak tweemaal uit, waarbij ze
gebruik maakten van ofwel een 7-puntslikertschaal, ofwel een Magnitude
Estimation schaal. Zo werden gegevens verkregen over variatie tussen items (de
voorzetselgroepen in dit geval), tussen participanten, tussen meetmomenten, en
tussen meetmethodes (Likert vs. Magnitude Estimation). Ik zet uiteen hoe de
verschillende typen variatie informatie kunnen verschaffen over mentale
representaties van taal, en ik toon hoe ze gebruikt kunnen worden om hypotheses
over representaties te toetsen.
De uitkomsten van deze studie geven aan dat vertrouwdheidsoordelen
methodologisch betrouwbare, bruikbare data zijn in taalkundig onderzoek. De
scores die met de ene schaal verkregen waren, werden bevestigd door de scores
op de andere schaal. Er was bovendien in alle experimentele condities een vrijwel
perfecte correlatie tussen de gemiddelde scores op moment 1 en moment 2.
Daarnaast had, in iedere conditie, de meerderheid van de participanten hoge zelfcorrelatie waarden (m.a.w. iemands oordelen op moment 2 correleerden sterk
met diens eigen oordelen op moment 1). Ook was er sprake van een duidelijk
verband tussen vertrouwdheidsoordelen en corpusfrequenties.
De oordelen vertoonden, net als de dataset in Hoofdstuk 2, inter- en intraindividuele variatie. Gebruiksgebaseerde exemplar modellen (Goldinger 1996;
Hintzman 1986; Pierrehumbert 2001) bieden van nature ruimte voor dergelijke
variatie. In deze modellen bestaan mentale representaties van taal uit een set
exemplars die continu geüpdatet wordt. Een exemplar is niet een kleine
bandopname die opgeslagen wordt in je geheugen, maar een multidimensionale,
detailrijke representatie die volgt uit een proces van analyse en categorisatie
(Taylor 2012). In de oordelentaak moeten deelnemers de positie van een item op
een schaal van vertrouwdheid uitdrukken in één getal, terwijl de vertrouwdheid
misschien eerder een bewegend doel is in een ruimte die meer of minder breed
kan zijn. In dat geval is er niet slechts één ware score, maar een reeks waarden
die de vertrouwdheid van een item uitdrukken. Variatie in scores van moment tot
moment hoeft geen ruis te zijn; het kan de weerslag zijn van het dynamische
karakter van cognitieve representaties als meer, of minder, compacte clusters van
226
exemplars die variëren in sterkte afhankelijk van hoe frequent en hoe recent
bepaalde constructies zijn gebruikt.
Hoofdstuk 3 besluit met een bespreking van de overeenkomsten en verschillen
tussen oordelen die met behulp van Magnitude Estimation (ME) uitgedrukt
worden en Likertschaaloordelen. In verscheidene opzichten leverden de twee
schalen vergelijkbare uitkomsten op, maar er zijn ook verschillen waar rekening
mee moet worden houden bij het kiezen van een schaal. Zo kan alleen met de
Likertschaaloordelen bepaald worden of respondenten de meerderheid van de
items als vertrouwd (of niet vertrouwd) beschouwen, en of zij de gehele set items
de tweede keer vertrouwder achten (bijv. door het bezig zijn met de items tijdens
de experimenten). Een nadeel van het gebruiken van een Likertschaal is het risico
dat het aantal responseopties niet overeenkomt met de vertrouwdheidsgradaties
die de participanten reëel achten, waardoor er informatie verloren kan gaan. ME
staat participanten toe om precies het aantal gradaties te onderscheiden dat zij
relevant vinden. In het onderzoek dat beschreven wordt in Hoofdstuk 3,
onderscheidde het overgrote deel (83.3%) van de deelnemers meer dan zeven
gradaties bij het gebruik van ME, wat er op wijst dat een 7-puntslikertschaal
wellicht niet optimaal is voor het construct (vertrouwdheidsoordelen) en de items
(de 79 voorzetselgroepen) die hier gebruikt werden.
In Hoofdstukken 4 en 5 onderzoek ik inter- en intra-individuele variatie door middel
van drie experimenten die ik heb afgenomen bij drie groepen deelnemers: 40
recruiters en HR-managers, 40 werkzoekenden, en 42 studenten die zelden of
nooit vacatureteksten hadden gelezen (hierna de onervaren deelnemers
genoemd). Het is aannemelijk dat deze groepen verschillen in ervaring met
woordcombinaties die typisch zijn voor vacatureteksten (bijv. goede contactuele
eigenschappen, werving en selectie); er worden geen systematische verschillen
verwacht tussen de groepen in ervaring met woordcombinaties die kenmerkend
zijn voor nieuwsberichten (bijv. de Tweede Kamer, correcties en aanvullingen). De
woordcombinaties werden gebruikt als stimuli in een aanvultaak, een voice onset
time (VOT) experiment, en een vertrouwdheidsoordelentaak. Aldus onderzocht ik
of er een verband is tussen enerzijds de mate van ervaring met een bepaald
register en anderzijds (i) de verwachtingen die mensen genereren over woorden
die mogelijk volgen wanneer ze woordsequenties zien die kenmerkend zijn voor
dat register; (ii) de snelheid waarmee ze dergelijke woordcombinaties verwerken;
en (iii) hoe vertrouwd deze woordcombinaties voor hen zijn. Ook onderzocht ik
hoe verschillende soorten data van één participant, verkregen in verschillende
psycholinguïstische taken, zich tot elkaar verhouden. Ik heb maten die gebaseerd
zijn op data van een individuele participant vergeleken met maten die gebaseerd
zijn op data van verschillende mensen. Dit verschaft inzicht in individuele variatie
Samenvatting
227
en de toegevoegde waarde van gepersonaliseerde maten ten opzichte van
geaggregeerde data.
Hoofdstuk 4 doet verslag van de aanvultaak en de VOT-taak. In de aanvultaak
kregen de deelnemers incomplete frases te zien (bijv. goede contactuele …). Bij
iedere stimulus somden ze de aanvullingen op die binnen vijf seconden in hen
opkwamen. Deze taak levert informatie op over de verwachtingen die iemand
genereert over woorden die kunnen volgen. De antwoorden werden vergeleken
met de aanvullingen die voorkomen in een corpus met vacatureteksten en het
Twente Nieuws Corpus. De analyses wezen uit dat er wat betreft de
nieuwsberichtstimuli geen significante verschillen waren tussen de groepen in de
proportie van antwoorden die corresponderen met een aanvulling in het Twente
Nieuws Corpus. Op de vacaturestimuli, daarentegen, waren er significante
verschillen tussen de groepen, zoals verwacht. De responses van de recruiters
kwamen significant vaker overeen met aanvullingen in het vacaturecorpus dan de
responses van de werkzoekenden. De responses van de werkzoekenden kwamen
op hun beurt weer significant vaker overeen met aanvullingen in het
vacaturecorpus dan de responses van de onervaren deelnemers. Deze
bevindingen tonen aan dat er verschillen zijn tussen de participanten in kennis van
meerwoordsconstructies, en dat die verschillen die samenhangen met de mate
waarin zij ervaring hebben met deze constructies.
In de daaropvolgende VOT-taak kregen de participanten dezelfde
woordsequenties te zien (bijv. goede contactuele …), dit keer gevolgd door een
specifiek woord (bijv. eigenschappen) dat ze zo snel mogelijk moesten voorlezen.
Ik berekende hoeveel milliseconden het duurde voor iemand het woord begon uit
te spreken. Deze voice onset time geeft aan hoeveel tijd het kost om het woord te
verwerken in de gegeven context. De hypothese is dat het woord gemakkelijker
herkend en verwerkt kan worden als het reeds verwacht werd gegeven de context
(prediction-based processing models, Bar 2007; A. Clark 2013; Huettig 2015;
Kuperberg & Jaeger 2016; Kutas et al. 2011). In eerder onderzoek is de
voorspelbaarheid van een woord gekwantificeerd door middel van cloze
probabilities (het percentage van de deelnemers dat dat woord invulde in de
gegeven context) en suprisal estimates (de mate waarin het woord afwijkt van de
voorspellingen gegenereerd door taalmodellen die getraind zijn op corpus data).
Deze maten zijn gebaseerd op data van een grote groep taalgebruikers en gaan
dus voorbij aan inter-individuele variatie. Doordat iedere deelnemer aan mijn
onderzoek zowel de aanvultaak als de VOT-taak maakte, kon ik het verband tussen
reactietijden en iemands eigen verwachtingen onderzoeken.
Uit de analyses bleek dat de meerderheid van de recruiters en de
werkzoekenden sneller reageerde op de vacature-items dan op de
228
nieuwsberichtitems, terwijl het omgekeerde het geval was voor het overgrote deel
van de onervaren participanten. Vervolgens heb ik onderzocht in hoeverre de
variatie in reactietijden tussen items en tussen participanten verklaard kan worden
door verschillende maten van de voorspelbaarheid van een woord, waarbij ik
rekening hield met kenmerken van de woorden (woordlengte en woordfrequentie)
en het onderzoeksontwerp (de volgorde waarin items gepresenteerd werden). De
reactietijden in de VOT-taak bleken significant sneller te zijn als participanten het
woord genoemd hadden tijdens de aanvultaak – dit laatste werd uitgedrukt in de
variabele TARGETMENTIONED. De pre-activatie van woorden tijdens het genereren
van verwachtingen bleek de verwerking van de woorden in de VOT-taak zozeer te
vergemakkelijken dat woordfrequentie hier niets meer aan toevoegde. Doorgaans
worden hoogfrequente woorden sneller herkend en verwerkt dan laagfrequente
woorden, maar als het woord reeds genoemd was tijdens de aanvultaak had
woordfrequentie geen effect meer. Wellicht nog interessanter is dat de variabele
TARGETMENTIONED van invloed was op reactietijden bovenop het effect van
CLOZEPROBABILITY. Dit illustreert de toegevoegde waarde van maten die rekening
houden met variatie tussen participanten.
Na de VOT-taak kenden de deelnemers aan de hand van Magnitude Estimation
vertrouwdheidsscores toe aan de woordcombinaties. Hoofdstuk 5 beschrijft de
analyse van de vertrouwdheidsoordelen in relatie tot de data uit de aanvultaak en
de VOT-taak, en corpusfrequenties. Ik heb onderzocht of de mate waarin
woordcombinaties verankerd zijn in de mentale representaties van de
participanten niet alleen tot uitdrukking komt in de wijze waarop zij de
woordcombinaties verwerken, maar ook in hun metalinguïstische oordelen. Is de
mate van entrenchment onderdeel van iemands expliciete kennis en kunnen
metalinguïstische oordelen inzicht verschaffen in entrenchment? Aan de ene kant
zijn dergelijke oordelen het resultaat van cognitieve processen waarmee de
taalinput verwerkt wordt en waarmee er gereflecteerd wordt op de uitkomsten van
die verwerking. De oordelen doen daarmee een beroep op representaties van taal
die ook in andere vormen van verwerking een rol spelen (Branigan & Pickering
2017: 4). Aan de andere kant zouden oordelen beïnvloed kunnen worden door
kennis, overtuigingen, en biases die niet meespelen in taalverwerking (Dąbrowska
2016a; Branigan & Pickering 2017). Verscheidene onderzoekers zijn bezorgd dat
introspectie geen accuraat inzicht kan verschaffen in onderbewuste cognitieve
processen (o.a. Gibbs 2006; Roehr 2008; Stubbs 1993).
Er is al eerder onderzoek gedaan naar de relatie tussen
vertrouwdheidsoordelen en verscheidene soorten psycholinguïstische data. Een
beperking van die studies is dat de vertrouwdheidsoordelen van één groep
mensen komen en de taalverwerkingsdata van een andere groep. Een discrepantie
Samenvatting
229
tussen oordelen en verwerkingsdata zou kunnen betekenen dat de taken een
beroep doen op verschillende processen en kennis; het zou echter ook het gevolg
kunnen zijn van individuele variatie in cognitieve representaties van taal.
Aangezien in mijn onderzoek de verschillende soorten data afkomstig zijn van
dezelfde participanten, kan ik een onderscheid maken tussen variatie tussen taken
enerzijds en variatie tussen participanten anderzijds.
De verschillen tussen de groepen deelnemers in ervaring met een bepaald
register bleken tot uitdrukking te komen in de vertrouwdheidsscores die ze
toekenden aan woordcombinaties die kenmerkend zijn voor dat register. De
overgrote meerderheid van de recruiters beschouwde de vacature-items namelijk
als vertrouwder dan de nieuwsbericht-items, terwijl het omgekeerde het geval was
voor de onervaren participanten. Uit de analyses bleek vervolgens dat Iemands
eigen data uit de aanvultaak en de VOT-taak significante voorspellers waren voor
de vertrouwdheidsoordelen die diegene toekende. Dit wijst erop dat
vertrouwdheidsoordelen en andere soorten psycholinguïstische data een beroep
doen op dezelfde mentale representaties van taal en dat vertrouwdheidsoordelen
bruikbare data vormen om inzicht te verkrijgen in deze representaties.
In de laatste twee hoofdstukken reflecteer ik op de onderzoeken die ik heb
uitgevoerd. In Hoofdstuk 6 ligt de focus op de methodologische lessen die
getrokken kunnen worden uit mijn studies. Ik belicht de verdiensten van onderzoek
waarin verscheidene methodes gecombineerd worden en ik bied een overzicht
van de belangrijkste overwegingen in het ontwerp van dergelijk onderzoek. Aan
bod komen methodologische en praktische kwesties met betrekking tot het
selecteren van: corpusdata, metrieken om corpusdata te analyseren, stimuli,
experimentele taken, en participanten.
In Hoofdstuk 7 ga ik in op de theoretische implicaties van mijn bevindingen.
De resultaten geven aan dat er systematische verschillen zijn tussen mensen in
kennis en verwerking van woordcombinaties, en dat die verschillen in verband
staan met de mate van ervaring met deze woordcombinaties. Dit vormt
empirische ondersteuning voor hypotheses die volgen uit gebruiksgebaseerde
theorieën over taalkennis en –verwerking. Voorts bleken de data van een
participant afkomstig uit één type experiment een significante voorspeller te zijn
voor diens prestaties in volgende experimenten, bovenop maten die gebaseerd
zijn op data van een grote groep taalgebruikers (corpusfrequenties, surprisal
estimates, cloze probabilities). Met andere woorden, gepersonaliseerde maten
hebben unieke verklarende kracht. Dit toont aan dat er sprake is van
systematische, meetbare inter-individuele variatie in gedragsmatige indicaties van
cognitieve routinisering. Variatie is alomtegenwoordig, maar niet willekeurig. Als
we tot accurate theorieën over de cognitieve representatie van taal willen komen,
230
is het van belang dat we in kaart brengen welke factoren de variatie tussen
taalgebruikers bepalen, en dit vereist dat we ons niet beperken tot geaggregeerde
data, maar inzoomen op het niveau van individuen.
Afgezien van inter-individuele variatie, gaven mijn data ook blijk van intraindividuele variatie. Dit wijst eveneens op het dynamische karakter van mentale
representaties van taal. Psycholinguïstische taken die inzicht trachten te krijgen
in de mate waarin een taalelement verankerd is in iemands taalkennis drukken dit
doorgaans uit in een enkele waarde (bijv. een reactietijd). Als cognitieve
representaties de vorm aannemen van meer, of minder, compacte clusters
bestaande uit exemplars die variëren in sterkte, dan levert een enkele waarde een
incompleet beeld op. Om die reden pleit ik er niet alleen voor om aandacht te
hebben voor variatie tussen mensen, maar ook om meerdere metingen per
participant uit te voeren.
Tot besluit schets ik drie richtingen voor vervolgonderzoek die voortbouwen op
het werk dat ik in dit proefschrift presenteer. Ten eerste stel ik voor om
gepersonaliseerde maten verder te ontwikkelen. In mijn onderzoek heb ik de
responses in de aanvultaak omgezet in een variabele die voor iedere participant
afzonderlijk aangeeft of diegene het targetwoord wel of niet genoemd had
(TARGETMENTIONED). Dit bleek een waardevolle maat te zijn. Aangezien het een
binaire variabele is, kan het echter geen recht doen aan graduele verschillen in de
voorspelbaarheid van woorden. Ik doe suggesties voor manieren waarop het
potentieel van gepersonaliseerde maten verkend kan worden.
Ten tweede stel ik voor om participanten gedurende enkele weken of maanden
te volgen. Dit kan meer inzicht verschaffen in de effecten van gebruiksfrequentie
op verwerkingssnelheid en vertrouwdheidsoordelen. Het is duidelijk dat frequentie
van grote invloed is. Het is nog niet zo helder of het uitmaakt hoe recent een
constructie is gebruikt; of het uitmaakt of je een constructie één of twee keer hebt
gebruikt die dag; en of dit bij laagfrequente constructies anders uitpakt dan bij
hoogfrequente.
Ten derde stel ik voor ik om, naast lexicaal specifieke constructies (zoals
woorden en woordcombinaties), ook (gedeeltelijk) schematische constructies te
onderzoeken. Volgens gebruiksgebaseerde theorieën komen alle mentale
representaties van taal voort uit ervaringen met taal, waarbij gebruik wordt
gemaakt van algemene cognitieve vaardigheden zoals patroonherkenning,
chunking, categorisatie, en schematisering. Als (gedeeltelijk) schematische
constructies gevormd worden door ervaringen met taal, dan zou ook hierbij interen intra-individuele variatie te verwachten zijn. Wel is het zo dat schematische
constructies doorgaans een algemenere betekenis hebben, in een groter aantal
contexten gebruikt worden, en een hogere gebruiksfrequentie hebben dan lexicaal
specifieke constructies. Mogelijk doet zich hierdoor minder inter- en intra-
Samenvatting
231
individuele variatie voor. Waar ook rekening mee dient te worden gehouden is dat
taalgebruikers onderling kunnen verschillen in cognitieve vermogens, zoals
taalanalytisch vermogen, statistisch leervermogen, fluïde intelligentie, en
cognitieve motivatie (Dąbrowska 2018; Misyak & Christiansen 2012). Zowel
ervaringen met taal, als cognitieve vermogens lijken van invloed te zijn op het
proces van schematiseren en kennis van grammaticale constructies. Er zijn
aanwijzingen dat dit voor kennis van woordcombinaties niet op dezelfde manier
geldt (Dąbrowska 2018). Mentale representaties van woorden,
woordcombinaties, en abstractere patronen kunnen nog steeds opgevat worden
als constructies die ontstaan uit ervaringen met taal in combinatie met algemene
cognitieve vaardigheden, maar de mate waarin ze een beroep doen op bepaalde
cognitieve en ervaringsgerichte factoren zou kunnen variëren. Ik hoop dat dit
proefschrift een bijdrage levert aan deze onderzoeksagenda door te demonsteren
dat het niet alleen mogelijk, maar ook zinvol is om rekening te houden met, en
aandacht te schenken aan, inter- en intra-individuele variatie. Het zou mij zeer
verheugen als mijn onderzoek taalkundigen weet te enthousiasmeren voor zo’n
benadering.
232
Curriculum vitae
233
Curriculum vitae
Véronique Verhagen (Eindhoven, 12 December 1985) obtained her gymnasium
diploma from Lorentz Casimir Lyceum in 2003. She then completed the
bachelor’s program Linguistics and Intercultural Communication at Tilburg
University, as well as the extracurricular, interdisciplinary honors program. Upon
obtaining her bachelor’s degree in 2006, she was awarded an Excellence
Scholarship. Subsequently, she studied at Venice International University and took
additional courses at Tilburg University. After that, she completed the research
master’s program in Language and Communication (Tilburg University and
Radboud University Nijmegen), with a specialization in cognitive linguistics. For
her thesis on individual differences in entrenchment of multi-words units, she
received the Tilburg University research master’s thesis award. After her
graduation, she started as a PhD candidate at Tilburg University, supported by an
NWO Promoties in de Geesteswetenschappen grant. In 2015 and 2016, she was
appointed as a part-time lecturer at the department of Dutch Language and
Culture at Leiden University. In 2017 and 2018, she taught a variety of courses in
the department of Communication and Cognition at Tilburg University. As from
January 2019, she is a lecturer in the Dutch teacher training program at Fontys
University of Applied Sciences.