Academia.eduAcademia.edu

The ListTyp Database

2021

The ListTyp Database Francesca Masini1 , Simone Mattiola1 , Stefano Dei Rossi2 1. Alma Mater Studiorum – University of Bologna, Italy 2. WebSoup, Italy francesca.masini@unibo.it, simone.mattiola@unibo.it, stefano@websoup.it Abstract English. The paper describes the aim and structure of a new freely accessible resource – ListTyp: A typological database of listing patterns – with a focus on methodological aspects, encoded information and search functions. Italiano. L’articolo descrive le finalità e la struttura di una nuova risorsa liberamente consultabile – ListTyp: A typological database of listing patterns – focalizzandosi su aspetti metodologici, informazioni codificate e funzioni di ricerca. 1 Listing Patterns and Typology Typological investigation is challenging in its own right, let alone when it tackles ‘untraditional’ categories, namely (newly-established) categories that are not part of the stock of customary, longestablished concepts for linguistic description, hence not usually described in grammars, at all or as such. ‘Lists’ belong to this class. Lists are traditionally associated with spoken language and interaction (see, among many others, Blanche-Benveniste (1990), Jefferson (1990), Selting (2007)). However, a broader approach has been proposed by Masini et al. (2018), who define ‘lists’ as syntagmatic concatenations of two or more units of the same type (potentially paradigmatically connected) that fill one and the same slot within the larger construction they are part of. This abstract definition embraces linguistic phenomena normally ascribed to different levels (morphology, syntax, discourse). ‘Lists’, or ‘listing patterns’ (LPs), thus encompass syntactic and discourse structures like coordination (e.g. The Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). system allows gas, electricity and water meters to be read [British National Corpus]), reformulation (e.g. They now had lifts, or rather elevators [British National Corpus]) or repetition (e.g. Some people are very very very touchy [British National Corpus]), but also lexical and morphological phenomena like irreversible binomials (e.g. alive and kicking), (co)-compounding (e.g. Chuvash sĕt-śu lit. milk-butter ‘dairy products’, Wälchli (2005), p. 138) and full reduplication (e.g. Sundanese hayan-hayan lit. RED-want ‘want very much’, Moravcsik (1978), p. 321). Although these phenomena have their own specific properties (displaying different degrees of complexity, cohesion and conventionalization), lumping them together may unveil interesting (cross-linguistic) structural and functional tendencies and help bridging the gap between discourse and grammar. Attempting a typological study of LPs is not trivial and raises methodological issues. Data are available for some widely described LPs (e.g. coordination, reduplication, co-compounding), but other types of LPs are far from simple to find in descriptive grammars, which usually (and understandably) focus on long-established categories in phonetics, morphology and syntax (leaving often aside, e.g., syntax beyond the clause and discourse phenomena). The same applies to typological databases. Hence, doing typology in the ‘traditional’ way turns out to be hard, and a new integrated methodology for carving out the required data is needed (Masini and Mattiola, 2019). 1.1 A Three-Level Methodology The ListTyp database embodies this new methodology, which consists of three levels complementing each other (and running partially in parallel), encompassing both horizontal and vertical dimensions of investigation. Firstly, a traditional large-scale examination of descriptive grammars is pivotal. For this first level (Level 1: horizontal), a ‘variety sample’ (Miestamo et al., 2016) represents the best option.1 This sample should be as large as possible (ideally 400-500 languages) to let the widest variety emerge. To this end, we have specifically created a sample of 424 languages (including isolate languages, pidgins/creoles and sign languages), following the Diversity Value technique with Ethnologue’s 20182 genetic classification, which has proven to be the most reliable (Miestamo et al., 2016). Descriptive grammars for these languages were selected according to criteria such as: (i) exhaustivity (in terms of contents); (ii) searchability (digital edition); (iii) presence of (possibly glossed) texts; (iv) recentness. In order to facilitate the (time-consuming) process of data gathering, we subsequently created, from this larger sample, a smaller sample of 223 languages (with its own internal cohesion, based on the same ‘variety’ principles), which is what we are currently using to populate the database (cf. Mattiola (2020) for more details). Level 1 aims at achieving a preliminary survey of how languages work, but it merely scratches the surface: the general ‘imperfections’ of large-scale typology are made worse by the ‘untraditional category’ status of LPs, thus calling for other layers of investigation. Secondly, a qualitative analysis of corpora and texts (e.g. texts at the end of descriptive grammars, free corpora, corpora made available by fieldworkers, etc.) is particularly useful to detect naturally occurring lists that are hard to be found in descriptive grammars used for Level 1. Needless to say, corpora of spoken language are especially useful for our current purposes. For this second level (Level 2: intermediate), the (convenience) sample is necessarily much smaller (ideally 2030 languages). Level 2 maximizes the possibility to find discourse-level data (not necessarily described within the grammar) and allows to get over the problems of ‘traditional’ typology by verifying directly in a (albeit small) corpus data that the horizontal level did not bring out. The third level, connected to the second, consists in a more quantitatively-oriented analysis of larger (possibly annotated) corpora of few (2-5) selected languages, which would provide enough data to draw some generalizations. Corpora might 1 A variety sample does not represent a balanced picture of the world’s languages. Rather, it captures the broadest possible variation in order to maximize linguistic diversity. 2 https://www.ethnologue.com/ be either manually scrutinized (entirely or partially) or searched automatically through specific queries (depending on corpus annotation and size). The outputs of automatic searches are subsequently processed and checked manually. This level (Level 3: vertical) represents languagespecific investigations that allow to study lists in much greater detail and to detect properties and constructions that more traditional methods might not be able to bring to light, as well as similarities between ‘distant’ languages. The idea behind this three-level methodology is that combining data from different sources and extraction techniques not only enriches our database with new occurrences, but also contributes to unveil new patterns and to spot previously unexpected cross-linguistic correspondences. We believe that the very same methodology might be fruitfully applied to the typological investigation of other linguistic phenomena. At a more advanced stage of the project, we will also consider crowdsourcing as a collection technique, especially for underrepresented languages. 2 ListTyp Contents ListTyp is an ongoing project: at present, the database is still only partially populated – counting 1685 examples of LPs from 156 languages – although its architecture is complete and freely available online: https://listtyp.it/. The database is made of three main datasets (Dataset A, Dataset B, Dataset C) plus a supplement (Dataset D), each of which is partially independent, although they obviously concur to create the whole resource. Searches may be run on a single dataset or on the whole database. Datasets A, B and C coincide with the three levels described in Subsection 1.1. They share the same architecture in terms of annotated properties and search criteria. However, they were gathered following (partially) different methodologies, which resulted in (partially) different sets of data, that are not directly comparable. 2.1 Dataset A Dataset A is the result of Level 1 in our methodology, based on a large sample of typologically different languages. Hence, it represents the most ‘typological’ part of our database. Dataset A is being populated following the 223-language sample mentioned in Subsection 1.1 and currently con- tains 769 examples of LPs belonging to 152 languages. See the following example from Atayal: musa’ magaN qsinuw, ini’ ga’ piku’ ru’ ini’ ga’ bzwaq ru’ ini’ ga’ yapit ga’ lit. ACT-go ACT-take animal NEG GA’ squirrel and NEG GA’ wild-pig and NEG GA’ flying-squirrel GA’ ‘(He) went to hunt animals: either squirrels, or wild pigs, or flying squirrels’ (cf. Rau (1992), p. 188). 3 2.2 Dataset B 3.1 Parameters Dataset B is the result of Level 2 in our methodology, based on a much smaller sample of typologically different languages, which are analyzed through small-size (glossed) texts. The sample for this dataset is still undefined and is being built incrementally on the basis of availability. Languages to be included in Dataset B preferentially do not coincide with those included in Dataset A, but not necessarily. At present, Dataset B contains 72 examples of LPs from one language (NapoletanoCalabrese, Cilentan variety), extracted from a spoken corpus (e.g. era tandu bella e tandu bella ‘(She) was so nice and so nice’). The main parameters, to be visualized on the ‘Examples’ webpage as a grid, include: 2.3 Dataset C Dataset C is the result of Level 3 in our methodology, based on few languages, which are however analyzed in a more thorough way using larger corpora. At present, Dataset C contains 661 occurrences from one language (Italian), taken from the spoken corpus LIP (De Mauro et al., 1993) (e.g. è lui che organizza l’estorsioni le rapine i sequestri eccetera eccetera ‘He is the one who organizes extortion, robberies, kidnappings etcetera etcetera’). Further data from (spoken and written) Italian are being processed for inclusion in the database. 2.4 Dataset D: Supplement The addition of a fourth dataset was necessary to document sparse examples collected in various ways by the ListTyp team and their students or other colleagues connected to the project. This supplement was therefore created without following any specific criterion, with the sole objective of enriching the resource. At present, Dataset D contains 183 lists (from written Italian, Russian and Spanish) connected to the COVID pandemic and manually gathered from Facebook (e.g. No se van a controlar fiestas reuniones bares discotecas aforos ‘No control of parties, meetings, bars, discotheques, capacity will be carried out’). ListTyp Design ListTyp is a web-based relational database containing a large number of parameters. Data, extracted with the different methods described in Subsection 1.1, were manually annotated by data collectors (whose contribution is acknowledged on the database website) under the supervision of the project directors. • Language: the name of the language according to Ethnologue (e.g. ‘Tamasheq’). • Source: the type of source the example comes from (descriptive grammar, corpus, elicitation, web, social network, etc.). • Example: the example as it appears in the original source (with no adjustments). • Glosses: if the example was glossed in the original source, the original glosses are provided (with no adjustments, in most cases), otherwise they are added (in English) by the data collector. • Translation: if the example was translated in the original source, the original translation is provided (with no adjustments)3 , otherwise it is added (in English) by the data collector. • Schema: the abstract structural skeleton of the example (e.g. the schema for example lifts, or rather elevators would be ‘X or Y’). • Construction: the grammatical phenomenon to which the example can be traced back, based on the commentary provided by the grammarian or the intuition of the fieldworker or data collector, despite the proliferation of terms this may entail. At present, ListTyp counts 13 values for this parameter4 , although the vast majority of examples are annotated as Coordination, Juxtaposition and Reduplication/repetition. 3 Translations are mostly in English but also in other languages like French or Spanish. 4 The values are: Alternative interrogatives; Cocompounding; Complex compounding; Compounding; Constrastive marker; Coordination; Coordination/list; Juxtaposition; List; Partial repetition list; Reduplication/repetition; Reformulation, Self-repair. • Function: the function conveyed by the example based, again, on the commentary/translation provided by the grammarian or the intuition of the fieldworker or data collector. Here the proliferation of values is even more marked than for the ‘Construction’ parameter, as easily expected. At present, ListTyp counts 34 tags for this parameter5 , some of which are declared uncertain cases (like ‘Plural / intensifying’), although there is a clear predominance of some functions like Additive and Alternative, but also Pluractional and Intensifying.6 By using the advanced search, other parameters are searchable, divided into three main groups of information: (i) Language info; (ii) Metadata; (iii) Formal and functional properties. Information under Language info includes: • Iso Code 639 3: the code for the representation of names of languages (Part 3). • Macro Area: ‘Africa’, ‘Australia’, ‘Australia & New Guinea’, ‘Eurasia’, ‘North America’, ‘South America’. • Family / Genus / Sub Classification: following Ethnologue’s genealogical classification. Information under Metadata includes: • Reference: the source (grammar, corpus, etc.) from which the example was taken. • Page: the page or other reference – depending on the type of source – from which the example was taken. • Collector: the person(s) responsible for (finding and/or uploading) the example. • Other Examples: similar examples to be found in the same grammar (for the time being, only one example per type of structure is included in Dataset A). 5 The values are: Additive; Additive / sequentiality; Adverbialization; Alternative; Alternative / approximating; Antipassive; Approximating; Attenuative; Categorizing; Clarification; Collective; Contrastive; Contrastive focus; Diminutive; Distributive; Emphasis; Endearment; Enumeration; Generalizing; Intensifying; Intensifying / pluractional; Nominalization; Non-prototypicality / plurality; Pluractional; Plural; Plural / intensifying; Politeness; Predicative; Reciprocal; Reformulation; Related variety; Self-repair; Skepticism; Stylistic effect; Word formation 6 Both the ‘Construction’ and the ‘Function’ parameters and their values will be subject to reflection at a later stage of the project. Information under Formal and functional properties (taken and adapted from Masini et al. 2018, to which we refer for details) includes: • Syndesis: presence of connectives (‘yes’) (e.g. Kuot U-rau, n@mo bun me-n@mu-a ga me-o lit. 3mS-be.afraid COMPL APPR 3pSkill-3mO and 3pS-eat.3sO ‘He was afraid lest they kill and eat him’, cf. Lindström (2002), p. 11) or absence of connectives (‘no’) (e.g. Lijili Ziriji kè, móotòo kè, ńjìn kè lit. train here-is, motor here-is, engine here-is ‘There are trains and cars and engines’; cf. Stofberg (1978), p. 104). • Type Of Syndesis: ‘conjunctive’ (cf. the Kuot example), ‘disjunctive’ (e.g. Yaul Kawana mï mïnda o utam ama-p lit. [name] 3SG banana or yam eat-PRF ‘Kawana ate either a banana or a yam’, Barlow (2018), p. 303) or ‘adversative’ (e.g. Madura Hanina ngenom kopi tape banne teh lit. Hanina AV.drink coffee but not tea ‘Hanina drinks coffee but not tea’, cf. Davies (2010), p. 339). • Prosodic Marking: presence (‘yes’) or absence (‘no’) of (this field largely depends on the kind of source used and on the possibility to perform a prosodic analysis on the datum). • Type Of Prosodic Marking: if present (open field). • Number Of Conjuncts: the number of items that make up the LP example (‘2’, ‘3’, ‘4’, etc., up to very complex examples, like this from Italian, found in the LIP corpus (Dataset C): RAIDUE o RAITRE o Canale cinque o Montecarlo Teleroma Gbr o Videomusic Retequattro chi piu’ ne ha piu’ ne vede ‘RAIDUE or RAITRE or Canale Cinque or Montecarlo Teleroma Gbr or Videomusic Retequattro whoever has more sees more’). • Complexity Of Conjuncts: ‘Word’, ‘Phrase’, ‘Sentence’. • Category: ‘Nouns’, ‘Verbs’, ‘Adjectives’, ‘Adverbs’, ‘Numerals’, etc. See for instance, in Gooniyandi, a case of reduplication of verbs (doog ‘tap’ > doogdoog ‘tap repeatedly’, cf. McGregor (1990), p. 83) vs. a case of reduplication of nouns (barndanyi ‘old woman’ > barndanyibarndanyi ‘old women’, cf. McGregor (1990), p. 237). • Presence Of Determiners: ‘yes’ or ‘no’ (when the ‘Category’ is tagged as ‘Nouns’). • Dialogic: ‘yes’ or ‘no’ (referring to the fact that lists may be dialogically co-constructed by speakers in interaction). • Interruption: ‘yes’ or ‘no’ (referring to the fact that lists may be interrupted by, e.g., discourse markers or hesitations in interaction). • Type Of Interruption: if present (open field). • Presence Of General Extender: ‘yes’ or ‘no’ (general extenders being elements like and stuff like that, and so on, etcetera found at the end of a list, cf. Overstreet (2005)). See for instance Daga ogi guep eragi kerip iravi lit. banana loin/cloth mat betel/nut all ‘banana, loin cloth, mat, and betel nut, all (of them)’ (Murane (1974), p. 94) or NapoletanoCalabrese (Cilentan variety) add’a ballà tutto ’u tribbunale // sègge // tavuli // tuttu còse! lit. have.PRS.3SG COMPL dance.INF all DET court chairs tables all things ‘It has to dance all the court: chairs, tables, all the things’ (from Dataset B). • Type Of General Extender: if present (open field). • Presence Of List Surroundings: ‘yes’ or ‘no’ (list surroundings being elements connected to the LP that occur in its immediate context). • Type Of List Surroundings: the values are ‘projection component’ or ‘post-detailing component’ (cf. Selting (2007)). In addition, the specific expression may be optionally added between square brackets. See e.g. this Italian example taken from the LIP corpus (Dataset C): la seconda guerra mondiale e’ [...] una guerra con armi piu’ sofisticate bombe cioe’ una guerra proprio di distruzione ‘World War II it’s [...] a war with more sophisticated weapons bombs that is a war of destruction’, where cioe’ una guerra proprio di distruzione ‘that is a war of destruction’ is a post-detailing component. • Compositional: ‘yes’ or ‘no’ (referring to the fact that lists may have different degrees of compositionality, a more or less literal/exhaustive interpretation, which we had to bring back to a binary value for simplicity). Reduplication examples like Lavukaleve lafa ‘place’ > lafalafa ‘every place’ (Terrill (2003), p. 36) or compounds like Kwewa, East no’go-naaki lit. girl-boy ‘children’ (Yarapea (2006), p. 169) are clear cases of non-compositional LPs, although non-literal, non-exhaustive lists are common in syntax too. • Natural Vs Accidental Coordination: the possible values are ‘natural’ (marking that the conjuncts of the LP are lexico-semantically related, like in Havasupai-Walapai-Yavapai had(a)-ch bos(a)-m day-k-yu lit. dog-SUBJ cat-with 3=play=pl-ss-aux ‘A dog and a cat are playing (together)’; cf. Watahomigie et al. (1982), p. 55) and ‘accidental’ (not lexico-semantically related, like in Gooniyandi dawoonggoowaangginmiyi jaji maa-mi ngaaddi-mi lit. you:two:like:it what meat-IND stone-IND ‘Do you two want meat or money?’, cf. McGregor (1990), p. 286), largely as intended by Wälchli (2005). • Semantic Relation Between Conjuncts: the possible values are either the lexico-semantic relation between the conjuncts (‘Synonyms’, ‘Co-hyponyms’, ‘Antonyms’, etc.; plus ‘Near-identical’ / ‘Identical’) or the fact they are ‘Frame-related’ or ‘Unrelated’. Some fields may contain a double slash (//), which means that the field was deemed either irrelevant (‘does not apply’) or uncertain (’to be checked’). 3.2 Search Options and Functions Each of the parameters presented in Subsection 3.1 can be searched alone or in combination with other parameters. A specific set of filters can be saved and re-applied. The same holds for specific grid sorts. When performing a search, all valid hits appear in a tabular grid on the ‘Examples’ webpage. 3.3 Data Visualization Data resulting from a query are visualized as text (relevant languages may be visualized on a map). The ‘Examples’ webpage shows the main parameters only, whereas the rest of the parameters are available through the ‘Advanced search’ interface. However, a function is available to personalize the main grid configuration in terms of page size, default filter criteria, default sort criteria, and order and display of grid columns. Each single example in the database has three options of visualization (see the Appendix): (i) as a line on the tabular grid, where each column corresponds to one of the main parameters (or the parameters customized and set by the user); (ii) as a ‘traditional’ horizontal example with interlinear morphemic glosses (which shows up on request right below each line in the column grid); (iii) as a separate full-page ‘card’ containing all the information available for that item, including main parameters, advanced search parameters, and localization map. 4 An Open Project ListTyp is an ongoing project that welcomes collaborations for both data collection and analysis. We are currently processing data for completing Dataset A and enriching the other datasets. Updates will be published periodically. A full documentation will be available soon. Acknowledgments ListTyp is an outcome of universaLIST – List constructions in typological and cognitive perspective, a 3-year project (2017-2020) funded by the Department of Modern Languages, Literatures, and Cultures (LILEC) of the University of Bologna. The project is part of the research network LIST – Listing in Natural Language led by Francesca Masini and Caterina Mauri. The search interface and web design were built by WebSoup (Lucca, Italy): https://www.websoup.it/. Eva Lindström. 2002. Topics in the Grammar of Kuot. Stockholm University Doctoral Dissertation, Stockholm. Francesca Masini and Simone Mattiola. 2019. Come fare tipologia con categorie non tradizionali? In Chiara Gianollo and Caterina Mauri, editors, CLUB Working Papers in Linguistics 3, pages 282–294. CLUB – Circolo Linguistico dell’Università di Bologna, Bologna. Francesca Masini, Caterina Mauri, and Paola Pietrandrea. 2018. List constructions: Towards a unified account. Italian Journal of Linguistics, 30(1):49– 94. Simone Mattiola. 2020. Two language samples for maximizing linguistic variety. Alma Mater Studiorum - Università di Bologna, Bologna. William McGregor. 1990. A Functional Grammar of Gooniyandi. John Benjamins, Amsterdam/Philadelphia. Matti Miestamo, Dik Bakker, and Antti Arppe. 2016. Sampling for variety. Linguistic Typology, 20(2):233–296. Edith Moravcsik. 1978. Reduplicative constructions. In Joseph Greenberg, editor, Universals of human language, volume 3: Word Structure, pages 297– 334. Stanford University Press, Stanford. Elizabeth Murane. 1974. Daga grammar: From morpheme to discourse. The Summer Institute of Linguistics and the University of Texas at Arlington, Norman. Maryann Overstreet. 2005. And stuff und so: Investigating pragmatic expressions in English and German. Journal of Pragmatics, 37(11):1845–1864. Der-Hwa Victoria Rau. 1992. A Grammar of Atayal. UMI [Cornell University Doctoral Dissertation], Ann Arbor. References Margret Selting. 2007. Lists as embedded structures and the prosody of list construction as an interactional resource. Journal of Pragmatics, 39(3):483– 526. Russell Barlow. 2018. A grammar of Ulwa. University of Hawai’i at Mānoa Doctoral Dissertation, Mānoa. Yvonne F. Stofberg. 1978. Migili grammar. The Summer Institute of Linguistics, Dallas. Claire Blanche-Benveniste. 1990. Un modèle d’analyse syntaxique “en grilles” pour les productions orales. Anuario de Psicología, 47:11–28. Angela Terrill. 2003. A Grammar of Lavukaleve. Mouton de Gruyter, Berlin/New York. William D. Davies. 2010. A grammar of Madurese. Mouton de Gruyter, Berlin/New York. Lucille J. Watahomigie, Jorigine Bender, and Akira Y. Yamamoto. 1982. Hualapai reference grammar. American Indian Studies Center, UCLA, Los Angeles. Tullio De Mauro, Federico Mancini, Massimo Vedovelli, and Miriam Voghera. 1993. Lessico di frequenza dell’italiano parlato. Etaslibri, Milano. Gail Jefferson. 1990. List-construction as a task and resource. In George Psathas, editor, Interactional competence, pages 63–92. Irvington Publishers, New York. Bernhard Wälchli. 2005. Co-compounds and natural coordination. Oxford University Press, New York. Apoi Mason Yarapea. 2006. Morphosyntax of Kewapi. Australian National University Doctoral Dissertation, Berlin. Appendix: Visualizations for Example 269 Tabular grid Horizontal Full-page ‘card’ Available at: https://listtyp.it/row/view?id=269