Academia.eduAcademia.edu

Bayesian Phylolinguistics

2020, The Handbook of Historical Linguistics, Volume II

AI-generated Abstract

This research emphasizes the utility of Bayesian phylogenetic methods in historical linguistics, particularly utilizing large datasets to construct language family trees. It contrasts traditional qualitative approaches with Bayesian methods, highlighting their advantages in addressing combinatorial complexities and providing quantitative estimates of divergence times and subgrouping uncertainties. The study serves as a guide for incorporating Bayesian techniques into linguistic research.

BAYESIAN PHYLOLINGUISTICS Simon J. Greenhill1,2, Paul Heggarty1 and Russell D. Gray1,2,3 1. Department of Linguistic and Cultural Evolution, Max Planck Institute for the Science of Human History, 07745 Jena, German. 2. ARC Centre of Excellence for the Dynamics of Language, Australian National University, Canberra, ACT 0200, Australia. 3. School of Psychology, University of Auckland, Auckland 1142, New Zealand. 1. INTRODUCTION Change is coming to historical linguistics. Big, or at least “bigish data” (Gray and Watts 2017), are now becoming increasingly available in the form of large web accessible lexical, typological and phonological databases (e.g. ABVD (Greenhill et al 2008), Chirilla (Bowern 2016), Phoible (Moran 2014), WALS (Haspelmath 2014), Autotyp (Bickel et al 2017) and the soon to be released Lexibank, Grambank, Parabank and Numeralbank http://www.shh.mpg.de/180672/glottobank). This deluge of data is way beyond the ability of any one person to process accurately in their head. The deluge will thus inevitably drive the demand for appropriate computational tools to process and analyze the fast wealth of freely available linguistic information. In this chapter we will briefly describe one such set of computational tools – Bayesian phylogenetic methods – and outline their utility for historical linguistics. We will focus on four main questions: what is Bayesian phylolinguistics, why does this approach typically focus on lexical data, how is it able to estimate divergence dates, and how reliable are the results? 2. WHAT IS BAYESIAN PHYLOGENETICS? There is a proud tradition of building language family trees in historical linguistics that was popularized by August Schleicher in the nineteenth century (Atkinson and Gray 2005), but dates back to at least the seventeenth century (List el al 2016). These language family trees are typically constructed in a qualitative manner from innovations in lexical, phonological, and morphological characters. However, despite the fact that implicit in the comparative method is some kind of optimization procedure, traditional historical linguists do not use an explicit optimality criterion to select the best tree, nor do they use an efficient computer algorithm to search for the best tree. This is surprising given that the task of finding the best tree, or set of trees, is inherently a combinatorial optimization problem of considerable computational difficulty. For just five languages there are 105 ways of subgrouping them in a rooted bifurcating tree. For 10 languages this number grows to over 34,000,000, and for 100 languages the number of possible trees is greater than a number of atoms in the universe (Felsenstein 1979). Traditional language family trees suffer from two additional limitations – their branching patterns reflect a only a relative chronology and contain no information about the uncertainty of any proposed subgroupings. Bayesian phylogenetic methods provide a useful supplement to the comparative method. They enable us to build trees with both explicit estimates of branch lengths and subgrouping uncertainty in an objective repeatable manner. These trees can then be used to evaluate subgrouping hypotheses (Gray el al 2009, Greenhill and Gray 2012), date language divergences (Gray et al 2009, Bouckaert et al 2012), estimate ancestral states, test hypotheses of functional dependencies in linguistic features (Dunn et al 2011), and infer geographic homelands and migration routes (Bouckaert et al 2012, Grolemund et al 2016). So what exactly are Bayesian phylogenetic methods? 2 Bayesian Phylogenetic methods use Bayes Theorem to make probabilistic inferences about phylogenetic trees and their model parameters. The approach calculates a posterior probability distribution of trees P(A|B) as a function of the prior probability of a tree P(A) and the likelihood of the data (B) given the model of character change. In the analysis of DNA sequences the model specifies the number of parameters to be estimated for the rate/s of nucleotide substitution. They can range from simple models where there is just one rate (the Jukes Cantor model), to more complex models where different types of substitutions have different rates (e.g. the General Time Reversible model where there are six rate parameters to be estimated). The model can also include complexities such as between site rate variation where different nucleotide sites are fitted to a distribution of rates (e.g. a gamma distribution). MCMC (Markov Chain Monte Carlo) methods are typically used to estimate the posterior distribution. In this procedure the search through tree and parameter space starts with a random tree and arbitrary values for the model parameters and branch lengths. A likelihood score is calculated for this tree and set of parameters. Then a new tree and model parameters are proposed. If the likelihood score is better this proposal is accepted. If the likelihood is worse, then the proposal will be accepted with a probability determined by the ratio of likelihood of the proposed tree divided by the existing one. This process is continued for a very large number of steps (generations). The MCMC chain explores tree and parameter space incrementally finding the set of trees that best fit the data given the model. The tree topology, branch lengths and model parameters are saved at intervals so that the progress of the chain can be assessed. After an initial “burn in” period the MCMC chain should end up converging on a region where the trees are sampled in proportion to their posterior probability. In practice, it is sensible to use multiple runs with different random starting trees to check that the analysis has converged on this region. For a recent review of Bayesian phylogenetics in biology that includes a good discussion of some of the practical issues with MCMC searches see Nascimento et al (2017). A key feature of Bayesian phylogenetic inference is that the result of the analysis is not a single tree but rather a set of trees and their model parameters sampled in proportion to their posterior probability. This set of trees is often summarized in a consensus tree. Consensus trees are normally depicted with the posterior probability of a clade (subgroup) shown on the branch. This number is the percentage of trees in the posterior distribution that contains that branch. Thus we end up not just with a single optimal tree, but rather a set of trees that lets us evaluate the extent the data supports a particular inference (given the model and the prior). Figure 1 shows how this approach can be adapted for linguistic inferences. First, some data must be selected. To date this has mainly been basic vocabulary. The rationale for this choice will be discussed in the next section. At this point we will simply note that in principle other kinds of linguistic data can be used in phylogenetic analyses and have (see Greenhill et al 2010, Greenhill et al 2017), and that combined analyses are also possible. Given that we have lexical data the next and in many ways most crucial and time-consuming step involves coding the data for cognacy. Here the comparative method is king. We want real cognates not mere “look-a-likes”. While considerable progress has been made on automating cognate coding – up to 89% accuracy - these procedures are best viewed as assisting rather than replacing the linguist (List et al 2017). Once the lexical data have been coded for cognacy the cognate sets can be converted into a matrix suitable for phylogenetic analyses (Step 2). The matrix might either be a multistate or binary depending on the model of character change that is going to be selected. In the multistate matrix each semantic slot would be a character (column) with the value for each cell reflecting a cognate 3 class. In the binary matrix each cognate set has its own column and the value in the cell reflect the presence or absence of that cognate in the language (see figure 1). Step three involves the specification of the prior. This can include information on the likely distribution of trees, model parameters, branch lengths and the like. One advantage of the Bayesian approach is that information from other sources can be included in the prior to aid the best inference. For example, the age of ancient manuscripts or the timing of known historical events can be used to calibrate the trees to help date divergence times (see section 4). Inferences from other kinds of data (e.g. phonological innovations) could also be included in the prior to aid subgrouping inferences. Step 4 entails selecting a model of character change. Models can range from the simple binary single rate model with strict clock, to models with rate heterogeneity (Yang 1993) and a relaxed clock. There is a tradeoff in model selection. If the model is too simple the inferences may be inaccurate. If it is too complex (over-parameterised) then the excess of parameters to be estimated will inflate the variance associated with each estimate and thus limit the power of the analysis (Burnham and Anderson 1998). In careful Bayesian analyses the performance of different models should be evaluated using Bayes factors to determine the best model. In practice we have found that the covarion model (Penny et al 2001), where sites can switch between fast and slow rates on different parts of the tree, often outperforms simpler models such as the single rate or Dollo (single cognate gain) models. Similarly - and this will be of no surprise to critics of glottochronology - the relaxed clock model out performs the strict clock. Step 5 is the MCMC search through tree and parameter space. Because each step in the Markov chain is necessarily strongly correlated with the previous step, samples from the chain need to be taken many generations apart (e.g. 1000 or even 10,000 generations). The chain also needs to be run long enough to get well past “burnin”, and multiple runs with different random seeds are needed to check for convergence. The final step in the Bayesian analysis is to summarize the tree topology and the key parameters of interest such as divergence dates. While consensus trees are often used to summarize the inferences about the tree topology, it is our experience that linguists often ignore the posterior probabilities on the branches and treat all branches in the consensus tree equally. This is a bad mistake. The real result of the analysis is not a single tree but the posterior set of trees. A key feature of Bayesian phylolinguistics is that this set reveals the strength of support in the data for various subgroupings (given the prior and the model). A useful way to visualize this uncertainty is to plot all the trees from the posterior distribution on top of each other in a “densitree” (Bouckaert 2010). Figure 2 shows a densitree for Central Pacific languages. The tree was constructed from cognate coded basic vocabulary from the Austronesian Basic Vocabulary Database (Greenhill et al 2008). A binary covarion model was used in the analysis. The densitree shows that some subgroups such as Eastern Polynesian are reliably recovered. It also reveals considerable uncertainty in the placement of many languages, perhaps as a consequence of the conflicting signal caused by dialect networks (Gray et al 2010). 3. WHY USE “THE LEXICON”? Historical linguists do not generally see the lexicon as their data-set of choice. The lexicon is the level of language most exposed to borrowing, and it is also not immune to chance resemblances that may appear to go back to a common original form, but do not (e.g. Spanish mucho and English much). The comparative method insists instead on other types of language data, deemed much more conclusive of language relationships: form-to-meaning correspondences across extensive morphological paradigms (e.g. the parallels first identified between the case endings in Sanskrit, Greek and Latin); and repeated, consistent sound correspondences, linked by regular and natural ‘laws’ of sound change between them. So why have most Bayesian phylogenetic analyses 4 nonetheless drawn their comparative language data from the lexicon? And can the results be trusted? To answer these questions we first need to clarify what exactly is meant by ‘lexical data’. We also need to break this down into three separate issues, about the suitability of the lexicon for different purposes: (1) to establish whether languages are related within a family; (2) to recover the phylogeny of an already established family; and (3) to research a family’s chronology. 3.1 The Lexicon for Establishing Relatedness? Swadesh first drew up his famous lists to target meanings that he assumed were the most stable and resistant to borrowing. In the early heyday of lexicostatistics, this morphed into an optimistic assumption that even a minimal level of apparent matches between the Swadesh lists for two languages could be presumed to be deep shared cognates rather than loanwords, and thereby demonstrated a deep ‘stock’-level relationship (see Heggarty 2010: 307-309). Even the core lexicon is by no means entirely borrowing-free, however, so claims to establish relatedness from lexical lists alone have never convinced. Basic vocabulary is still used in exploratory studies, as a fertile hunting ground where deep cognates are most likely to survive. But that has to be followed up by the orthodox comparative method, to either confirm or dismiss whether any apparent matches reflect a relationship of common descent (cognacy), or not. Likewise, a Bayesian phylogeny or ‘family tree’ is meaningful only for languages that do actually form a family with each other, and only when based on data that are probative of phylogenetic relationships. Such data lie not in lexis per se, but in cognate status, which can only be established by the comparative method. So Bayesian phylogenetic methods are not proposed here as a method to try to establish whether languages are related in the first place. On the contrary, they rely on the comparative method to establish that in advance, so as to create the input data they need: cognacy judgements. In any case, a simple lexicostatistical count of putative ‘matches’ has nothing to do with Bayesian phylogenetic analyses and their explicit modelling of descent with modification through time. 3.2 Why Use ‘The Lexicon’ for Phylogeny? Once the comparative method has already established a family, one can then move on to the second task: recovering the structure of its family tree, i.e. the basic objective of phylogenetic analysis. To this end, the comparative method continues to privilege regular sound changes and morphology (especially paradigms, where available), however. It can seem questionable, then, that many Bayesian phylogenetic approaches are described as using ‘lexical data’ instead. It is easy to misconstrue and overplay, however, the apparent contrast between the comparative method and ‘mere’ lexical data. For a start, the preference for morphological paradigms belongs only to a best-case wish-list. In reality, extensive paradigms are simply not available in all language types. With isolating languages, the comparative method has to make do without them — hence the greater difficulties and doubts surrounding the deep reconstruction of families such as SinoTibetan. Lexical data are universally available (even if somewhat more limited in polysynthetic languages). There are also many under-documented and extinct languages for which lexical data are all we have, and all we ever will have. As for sound laws, in order to establish that a sound correspondence is regular and repeated, the comparative method requires multiple tokens of that correspondence. Where are those tokens to be found? Firstly, grammatical morphemes are so few in number that most tokens are necessarily found in the lexicon instead. Secondly, tokens of regular sound changes are by definition found mostly in cognates. (Loanwords may mimic some past sound changes when phonologically adapted into the borrower language, but otherwise show only those changes that arose thereafter.) Thirdly, cognate tokens are not evenly distributed across the whole lexicon, but 5 necessarily concentrated in that part of it least susceptible to lexical replacement (whether by loanwords or language-internal semantic shifts). So while in principle there is no special focus on basic vocabulary when searching for tokens of regular sound correspondences, in practice the comparative method finds much of its evidence (more than in any other comparable sector of the vocabulary) in exactly the same stable core of the lexicon that phylogenetic analyses use. More importantly still, much of the confusion and dismissal of the phylogenetic value in mere ‘lexical data’ is because that term is in fact a popular misnomer. The crucial comparative linguistic datum used is not lexicon, but cognacy. The 0 and 1 (or multi-state) values that form the input to phylogenetic analyses do not directly represent lexemes at all, or their phonological forms. What the 0s and 1s represent, directly and uniquely, are relationships of cognacy, i.e. common descent. Simply drawing up the Swadesh list for a given language is not enough. The list of lexemes is not the data, and is unusable by the method until each lexeme has been assigned to its correct cognate set vis-à-vis the lexemes in all other languages in that family. Only then is there any actual comparative data to be fed into the phylogenetic analysis — because, to repeat, the input data are not lexical forms, but cognacy relationships. So to describe the data as ‘lexical’ is a misnomer particularly because it implies, by omission, no input from phonology (sound correspondences), morphology, and the comparative method in general. Yet these are, on the contrary, all inescapably implicit in the concept of cognacy itself. The task of assigning cognacy, to create the input data, requires detailed, qualitative, historical linguistic analysis, by the comparative method. Any cognate set is defined by common descent. Specifically, it is defined and identified by the proto-form from which all its members descend directly (i.e. without borrowing, see below). The cognate set to which English rain and German Regen belong, for example, is defined by the ProtoGermanic form *regna (www.cobl.info/meaning/rain). Likewise, the fact that Greek kardía, Russian serdce and Old Irish cride are cognate is defined by the Proto-Indo-European form *koerd-, from which they all descend (www.cobl.info/meaning/heart). That is, to define cognate sets relies on the comparative method, its sound-change laws and reconstructions. Thousands of such individual cognate sets are what make up the phylogenetic signal that Bayesian approaches make use of. Cognacy assignment, when properly performed, integrates and rests on all of the data, methodology and findings of orthodox comparative-historical linguistics, not least in phonology and morphology. It is only that methodology that can reliably establish which word-forms are true cognates to each other, even when sound changes have obscured original similarities. Likewise, only the comparative method can tell apart true cognates from loanwords or from chance lookalikes, essentially because those do not exhibit the expected regular sound correspondences. The data that Bayesian phylogenetic analysis uses are best not described as just ‘lexical data’, then, but more accurately as cognacy data, for a sample list of precisely defined lexical meanings. Finally, one should be under no illusions as to the many complexities and challenges in preparing cross-linguistic data-sets of cognacy in lexical meanings. Early data-sets such as that by Dyen, Kruskal & Black (1992) have been rightly criticized for many failings; much progress is still being made towards more rigorous and consistent policies for drawing up such databases, specifically as appropriate for the purposes of phylogenetic analysis and chronology estimation. The methodology needed is too complex to do it justice in this chapter, however, and is set out in much more detail in Heggarty et al. (in prep.). For a major language family, assigning cognacy across over scores or hundreds of language varieties, ancient and modern, in up to 200 target 6 meanings for each, is a daunting task. Even for the best-researched language families like IndoEuropean, it demands painstaking historical linguistic analysis by large teams of language and family experts. That all serves only to re-emphasise the fundamental point: far from being discarded, all the methodology and findings of orthodox historical linguistics are precisely what databases of cognacy in basic lexicon are founded upon in the first place. 3.3 Size Matters There is one other crucial advantage of data in the form of cognacy in lexical meanings: there are enough of them. For Bayesian phylogenetic analysis to work well requires very significant amounts of data in order to estimate all main aspects of the results: the tree structure, branch lengths, model parameters, the corresponding time-depths, and so on. The large quantities of data needed are only really available from cognacy in the lexicon. The experience of Ringe et al. (2002), in their (non-Bayesian) search for a “perfect phylogeny” for the Indo-European family, is instructive here. Ringe et al. start out with the historical linguist’s instinctive preference for data characters in phonology and morphology. But they find only 22 or the former, and 15 of the latter, that they consider validly usable, and retain even “fewer morphological characters” in their follow-up study (Nakhleh et al. 2005: 394). This not only makes for a very small data-set, but as Ringe et al. (2002: 98) themselves admit, “the worst news is yet to come: the vast majority of our well-behaved monomorphic characters simply define one or more of the ten uncontroversial subgroups of the family, contributing nothing to their higher-order subgrouping”. In practice, the main higher-order nodes in their output trees turn out to be supported by just two, one or even none of these phonological and morphological characters. Most of the phylogenetic information ends up being provided by their lexical cognacy characters after all, thanks to the sheer number of them. A further consideration is that Ringe et al.’s phonological characters are effectively all just binary. Their P2 character “full ‘satem’ development of dorsals”, for example, allows values of just present or absent. Moreover, any one language cannot change from one of those states to the other more than once over the entire history of Indo-European. Each lexical meaning, however, typically corresponds to a range of many different cognacy states, effectively multiplying the amount of discriminatory data in that one meaning. For a broad family such as Indo-European, the average meaning has well over ten different cognacy states (for exact statistics per meaning see www.cobl.info/wordlist/Jena175). And in each meaning, changes from one (cognacy) state to another can arise continuously, across the tree, throughout the family’s divergence history. A 200-meaning list thus turns into many thousands of cognacy states, and even if not all of them are informative on the higher-order branching, they all still contribute crucially to the estimation of branch lengths and time-depths. This is also why phylogenetic analyses are better to aim for data-sets of the order of Swadesh’s initial 200 meanings, rather than shorter versions slimmed down for other purposes, such as Swadesh’s own 100-meaning list or Tadmor et al.’s (2010: 238-243) Leipzig-Jakarta list, also of just 100 meanings. This need for sheer quantity of data is another reminder of the scale of the task to draw up such comparative databases. 3.4 Not Cherry-Picking, But ‘Safety in Numbers’ Large data-sets, based on a pre-determined list of meanings intended to be applied to any language family, also have the advantage of forcing objectivity. They avoid the danger of individual scholars subjectively cherry-picking data characters, not least in phonology and morphology, that may be unrepresentative and open to interpretation (including, for instance, as to which state was 7 ancestral). It is striking that historical linguistics has still failed to come to a consensus family tree even for many of the best researched of all language families, including Romance, Germanic, and Indo-European as a whole. In part this is because of how the discipline has essentially proceeded on the disputed points: competing scholars each invoke small subsets of the language data, cherry-picked to argue for one tree structure over another, but which fail to convince while other cherry-picked subsets point to rival analyses. Subjectively excluding other selected data, meanwhile, comes with a risk of circularity, in that particular characters may be judged unreliable and excluded precisely because they do not fit with a preconceived idea of what the tree structure should be. Ringe et al.’s (2002) data-set is heavily “screened”, one of the reasons why they end up with so few informative characters, but they are still frustrated by far more characters incompatible with any tree (at least 16) than supporting the main higher-order nodes (mostly 4 or less). It is not that all language data necessarily reflect phylogeny; of course not. But determining which innovations most likely arose in parallel, for instance, and on which different branches, is precisely one of the results that emerges in any case from a powerful Bayesian phylogenetic analysis — and in the form of a balanced, quantitative assessment rather than a subjective and potentially circular judgement. We shall see shortly below the obvious dangers of cherry-picking data for chronological purposes, too, when trying to second-guess how much or how little language change is ‘plausible’ within a given time-span. Avoiding such subjectivity is certainly one of the attractions of large data-sets of cognacy across a pre-determined list of lexical meanings. 3.5 Why Use the Lexicon for Dating? Establishing the phylogeny of a language family is one thing; researching its chronology is quite another. The fact that glottochronology was directly based on lexicostatistics has given the lexicon a bad name for this purpose. The basic problem is not the lexicon itself, however, nor the concept of cognacy. Rather, the undoing of glottochronology was its unrealistically strict assumptions, especially of a universal, unvarying rate of change, and its simplistic distance-based measures, without any real model of linguistic ‘descent with modification’ through time. Certainly, it is hardly as if historical linguists have enjoyed much success and unanimity in looking to other levels of language, beyond the lexicon, to help in estimating time-depths. 3.6 Dating by Phonetics or Morphosyntax? In phonology and phonetics, for instance, rates of change and divergence appear to vary even more freely and spectacularly than is usual in the core lexicon. Take for example Latin [akʷam] water, [altʊm] high and [ad illʊm] to that, which in only two millennia have all attrited, in French, to just [o] (distinguished only in spelling as <eau>, <haut> and <au>), alongside other spectacular examples of change such as [kalidʊm]®[ʃo], [kaballos]®[ʃfo] and [habitʊm]®[y] (hot, horses and had). Or one can just as well cherry-pick examples of very slow sound change, even over the many millennia since Proto-Indo-European. Reconstructed *bʰréh₂t(ē)r brother, for example, still retains its initial obstruent + rhotic cluster little changed into modern Welsh [braut], Russian [brat], English [bɹʌðəɹ] and ‘even’ French [fʁɛʁ]. With such great variation in possible rates, arguing just from one-off examples leaves us open to cherry-picking instances where change has been either abnormally fast or abnormally slow, and thus unrepresentative and no proof of anything for dating purposes. Rather, it leaves us open to impressionistic and subjective expressions of conviction, and is what we need to get away from, in favour of an objective, balanced distribution of rates of change, and an explicit historical model. A keystone in the traditional short chronology for Indo-European, for instance, is the perception that Avestan and Vedic are ‘so close’ that the divergence between them ‘must’ be but a matter of a few centuries. Sims-Williams (1998: 126), for example, offers selected sentences that “may be 8 transposed from the one language into the other merely by observing the appropriate phonological rules”. Much the same, though, can be said of selected phrases in Italian and Spanish, for instance: see Heggarty & Renfrew (2014: 545). Change and divergence have been minimal in no end of word pairs, such as Italian [liŋgwa ] vs. Spanish [lɛŋgwa ] tongue (from Latin [liŋgʷa ]), [mɔndo ]~[mundo ] world, [θjelo ]~[t͡ʃelo ] sky, etc.. Indeed other cases show no real divergence at all: [kanto ]~[kanto ] I sing, [salta ]~[salta ] s/he jumps, and so on. Here too, simply applying phonological rules can straightforwardly transpose one language to the other, but that hardly proves a time-depth of divergence of just a few centuries, as traditionally envisaged between Avestan and Vedic. For the net divergence between Italian and Spanish to arise, out of Latin, took the same two millennia that in French saw much more radical change. Morphosyntax, likewise, does not seem to offer a viable data-type for language dating. Starting out from the inflectional case system of Proto-Indo-European, for instance, its modern descendants show an entire gamut of different amounts of net case loss, from total to almost nil. At one extreme, French has lost all traces of the system on nouns, while English retains only its ‘Saxon genitive’. Other languages have maintained parts of the system, albeit heavily eroded (e.g. Romanian and German). At the other extreme, meanwhile, most Slav languages still count up to six cases. So over the same time-depth since Proto-Indo-European, different branches and languages have lost (and sometimes, re-created) cases at very different rates and stages in time. Note how Bulgarian, for example, has bucked the trend of Slavic continuity and seen most of its case system collapse since the time of Proto-Slavic. It seems hard to make anything of this for chronological purposes. Take the famous case of Lithuanian, often evoked in admiration at how little some of its case-endings have changed since Proto-Indo-European. This is of no help to decide between the two main hypotheses on the time-depth of Indo-European: little change over six millennia hardly excludes little change over another few millennia, either. A further consideration is that different levels of language can change at very different rates. Within Romance, French is unquestionably an outlier in how much phonetic attrition it has undergone. In lexis, however, French is unremarkable, a typical Romance language with retention and cognacy rates of exactly the same order as most of its sister languages. Given all this variability, it is no surprise that those historical linguists who dare to make pronouncements on their ‘intuitions’ of how long a time-span is or is not plausible for a given range of divergence across a family … they do not necessarily agree. Many Indo-Europeanists, seduced by the claims of linguistic palaeontology (likewise subjective and disputed, see Heggarty 2014, Heggarty forthcoming), have taken as read a Steppe hypothesis time-depth of c. six millennia. That assumption has then set their benchmark for intuitions on what rates of divergence are ‘plausible’, and forced further assumptions on crucial chronological questions such as the time-depth of the Indic-Iranic split that led to Vedic and Avestan. Remove the assumptions, however, and other linguists such as Dolgopolsky (1990: 239) see the Steppe hypothesis timeframe espoused by Mallory (1989) as “utterly unrealistic”. Mallory’s “dating, which presupposes that Proto-Anatolian, Proto-Indo-Iranian, Greek and other descendant languages could have diverged from each other for a mere 2000 years, is absolutely inconceivable” for Dolgopolsky. Many more linguists would challenge other aspects of Mallory’s (1989) chronology, in which Proto-Germanic, for instance, as recently as 2500 years ago, is “a late Indo-European dialect”. 3.7 Why Linguistic Dating Should Be Possible Whatever levels of language one looks to, then, cherry-picked examples and impressionistic pronouncements from ‘intuition’ lead us to a dead-end, a stand-off between opposing subjective positions. So it is perhaps understandable that some linguists steer clear of any idea of dating 9 from language. “Linguists don’t do dates”, as McMahon & McMahon (2006) put it, while for Dixon (1997: 47, 49) “What has always filled me with wonder is the assurance with which many historical linguists assign a date to their reconstructed proto-Ianguage… How do they know? … It does seem to be a house of cards”. Such pessimism, however, actually loses sight of some basic tenets of historical linguistics. First, it is not entirely mysterious why rates of change can vary; there are often good explanations, both language-internal and language-external. Sub-systems within a language may long be stable, but then once perturbed (sometimes as a knock-on effect of changes on other levels of the language) can change rapidly until they settle into a new stable system, as in the case-system collapse in Bulgarian, or phonological system change in Old Irish. External contexts, too, can vary from relative isolation to punctuations of intense contact. This is widely seen as the explanation for Bergsland & Vogt’s (1962) paradigm case of the contrast between high cognate retention in Icelandic alongside bursts of lexical turnover in English, triggered by the Viking and Norman conquests. In short, where rates depart significantly from ‘default’ expectations, there is often a good reason. Second, it is clear that language change is unstoppable. So however much the rates may vary, changes do at least build up progressively through time. Indeed in practice many of the same historical linguists who repeat the de rigueur disclaimers that glottochronology is not to be trusted do themselves go on to invoke its ‘dates’, for want of anything better. See Kaufman & Golla (2000: 52), for instance, on the major language families of the Americas. So although glottochronology as a method is firmly discredited, on both of the above grounds there remains a widespread, grudging acceptance that rates of change over time are at least not a complete free-for-all. Or to put it more formally: notwithstanding some much-cited instances of exceptional change or stability, those plausibly represent the extremes of a distribution, even if a broad one, of more usual and more ‘modal’ rates of change. That already brings us closer to a potential way of making use of these basic tenets of historical linguistics for a much more sophisticated, objective and realistic approach to language dating. Certainly, with all the discipline’s knowledge of language history and how it proceeds, we should be able to do far better than just simplistically totting up numbers up to 100 or 200, as glottochronology did. Indeed, instead we can enlist another fundamental tenet, namely the process of linguistic ‘descent with modification’ by which any family of languages diverges out of its common ancestor. That points to the applicability, in principle, of phylogenetics (and note, clarifies that it is not glottochronology at all). 3.8 Why Lexical Cognacy for Dating? Given all those possibilities from Bayesian phylogenetics, it is worth reconsidering which type of language data might be most suitable as input, specifically for chronological purposes. We have already seen some of the problems with other levels of language, and in fact cognate status in the lexicon turns out to have one distinct advantage, specifically for chronological purposes, even over other, form-based data types that can seem more typically what historical linguists would use. For chronology, what is needed is a set of comparative data points that are free to change iteratively and repeatedly, so that their ‘clock’ does not at some point stop ticking and no longer allow change. This can be a problem with most form-based linguistic data: sounds or morphemes that disappear completely generally cannot then reappear, to start changing again. Take the example of ablaut grades in Indo-European. Many Indo-European roots are found shared across multiple branches of Indo-European, but from one branch to the next the lexemes used can be based on 10 different ablaut grades. In the meaning FOOT, for example, Germanic the lengthened o-grade of PIE *ped- (hence foot), Greek πούς, ποδός comes from the (short) o-grade, and Latin pes, pedis from the e-grade. Certainly, while a cognate set is essentially defined by a shared root morpheme, it can be helpfully subdivided into its various ‘cognate sub-sets’, according to finer morphological or phonological criteria such as different ablaut grades of that root, or the presence of one or other affix additional to the root. Ultimately, though, the ablaut alternation system of Indo-European largely broke down, and lexemes in different branches fossilized their forms, which were no longer free to change any further in this criterion. By the time of Latin, pes, pedis was already long fixed as of egrade, and thereafter all Romance languages inherited that fixed datum. Italian piede, or English foot, cannot change (back) to any other Indo-European ablaut grade, which ceased being a meaningful system many millennia ago. From a dating perspective, then, ablaut grade is, unhelpfully, a datum for which the change clock stopped ticking definitively, long ago deep in the history of the Indo-European family. The clock that does remain ticking, meanwhile, is (root) cognate status. Italian and English, like any other Indo-European language, could in principle ultimately switch to a new lexeme for FOOT, in a different (root) cognate set. This ability to iteratively change, for the clock never to stop ticking, is a crucial characteristic that makes cognate status in lexical meanings the data type of choice specifically for the purpose of dating. It is also one of the main reasons why it is important most advantageous to analyze cognate status at the level of the root, rather than to use cognate subsets to try to contribute to the chronological estimation. Note that there are in fact two tasks, then, that it is possible to take somewhat separately: recovering the phylogeny, and dating that phylogeny. The preference for using the cognacy turnover data alone for dating, then, does not limit the language data used to contribute to determining the phylogeny. To that end, one certainly can include finer analysis such as breaking down root cognate sets into their respective subsets, on further morphological and phonological criteria. Indeed, one can make use of morphological and phonological characters that are not iterative, but one-off or sequential changes. These can even be used to constrain the model to return only trees compatible with those characters, but then on that constrained phylogeny one can use only the lexical cognacy data to contribute to the dating. (See Atkinson et al. (2005: 209) for an analysis constrained to Ringe et al.’s (2002) data, including notably their key phonological and morphological criteria.) 4. How do Bayesian phylogenetic methods actually date language divergences? Dating is important – it enables us to link linguistic divergences to archaeological and historical events, and build a coherent story about human prehistory. The search for a rigorous methodology to make accurate inferences about timing first led to a set of approaches called lexicostatistics and glottochronology, and then to an almost puritanical rejection of these approaches. More recently, new methods from evolutionary biology – which do not share the fatal shortcomings of glottochronology – have been applied to linguistic questions. Let us start this section with a brief account of this history. Morris Swadesh (Swadesh 1950, 1952, 1955) proposed a simple computational approach to building language trees called lexicostatistics. He argued that the number of cognate words shared between two languages in basic vocabulary was a good indicator of degree of relationship. The logic is simple: successively grouping languages by amount of cognates they shared would give rise 11 to a family tree with more similar languages grouped together. Swadesh further realised that the degree of similarity shared between two languages was also an approximate measure of how much time had elapsed since they had separated. Making a direct analogy to radioactive decay, Swadesh argued that languages lost similarity at a relatively constant rate over time. These rates were naturally only applicable to a standardized sample of vocabulary that they were calculated from, so Swadesh proposed a series of standardized 100- and 200- item wordlists. For example, if two languages shared C cognates and there was a constant “clock” rate r, then the age of divergence t could be estimated by: t = log C / 2 log r Swadesh (1950) initially suggested that the clock rate r was 85%/1000 years based on the differences between old and modern English. This figure was later revised to 81%/1000 years based on a survey of rates in the basic vocabulary of historically attested languages (Lees 1953). For example, if two languages shared 81% cognates on a 200 item word list then the estimated age would be 1000 years, while two languages sharing 66% might be expected to have diverged 2000 years ago. This approach became known as glottochronology. Glottochronology was first applied by Swadesh to the Salishan languages (Morris Swadesh 1950). However, the promise of using linguistics to date prehistoric population expansions was so tempting that glottochronology was rapidly adopted by linguists and used on languages around the world (see Hymes (1960) for a review). The findings were widely read outside linguistics with eminent anthropologists carefully discussing the methodology (Kroeber 1955) and arguing that it revolutionized prehistory (Murdock 1964). However, critics were quick to note some serious methodological flaws – with the key flaw being that language change is not clock-like. Languages can vary substantially in their rates of lexical replacement. For example, a damaging critical study showed that glottochronology would estimate that Old Norse and Icelandic diverged less than 200 years ago. This age is far younger than the historically attested age of 1000 years (Bergsland and Vogt 1962). In fact, rather than the stately 81% loss per 1000 years proposed by glottochronology, these rates varied around 15-20% (Bergsland and Vogt 1962). The finding of substantial rate variation in languages was so damaging to glottochronology that historical linguistics largely rejected quantitative methods. Today, lexicostatistics and glottochronology are seen as textbook examples of bad linguistic methodology, and we are told that “linguists do not do dates” (McMahon and McMahon 2006). In biology, the scenario started in a similar vein but played out rather differently. Biologists had developed a method to build trees from pairwise similarity (e.g. Sokal and Michener 1958). Divergence times could then be estimated from these trees using a “molecular clock” rate (Zuckerkandl and Pauling 1965), e.g. Kimura (1968) estimated a rate of one base pair change every 1.8 years. Just as in linguistics, biologists were quick to notice the problem of rate variation which could cause incorrect estimates of both the relationships on the family tree (Felsenstein 1978), and the estimated ages (Kirsch 1969). Unlike historical linguists, who largely rejected computational approaches, biologists developed methods for handling variation in rates. One method, known as nonparametric rate smoothing (Sanderson 1997), takes the observed rates of change in the data, and ‘smooths’ these to fit around calibration information. Another method – currently the best – is known as the relaxed clock (in contrast to what might be called the strict clock assumption of glottochronology) (Drummond et al. 2006). In the relaxed clock model rates for each branch are drawn from a log- 12 normal or exponential distribution with parameters estimated from the data. This procedure allows each branch to have its own rate and for the data to inform the analysis about the magnitude of rate variation across the languages. To calibrate the clock, historical information is used to constrain the age of known branches. These calibrations allow the analysis to estimate the clock rates in regions where the timing is known and extrapolate these to infer rates in regions where the timing is unknown. While these calibrations can be simple point estimates (e.g. a given year) it is more common to specify these as probability distributions as this allows uncertainty in the divergence time estimates to be explicitly accounted for (Ho and Phillips 2009). For example, (Birchall, Dunn, and Greenhill 2016) used this approach to date the origin of the Chapacuran languages in South America to around 1,040 years ago. One of the calibrations they used was based on (Meireles 1989) suggestion that Moré and Cojubim speakers were a single population that diverged when the Jesuits arrived in the region in the 1740s and no mention is made of the Cojubim until 1781. (Birchall, Dunn, and Greenhill 2016) implemented this calibration as a log-normal distribution with a 95% probability that Moré and Cojubim diverged between 213 and 723 years ago – so the lower bound was close to the 1780s date, while the upper bound stretched 500 years in the past to allow for the two communities to have diverged somewhat earlier than the Jesuits arrival in case (Meireles 1989) was incorrect. Given the serious failings of glottochronology, how confident can we be that these methodological advances are accurate? In a simulation test of phylogenetic methods for resolving language history (Greenhill, Currie, and Gray 2009), we evaluated the performance of the rate-smoothing approach. First, we simulated linguistic data on two known phylogenies (described in more detail above). We then calibrated two arbitrarily chosen nodes on each tree and constrained them to be within +- 10% of their true age. We found that the rate smoothing approach was able to estimate a value close to the true root age even when borrowing rates reached up to 15% – the difference between the true age and the estimate age was 2.2% or 6.1% depending on the tree topology. As borrowing increased, the estimated age of the tree decreased. This result shows that these methods are fairly robust to most reasonable levels of borrowing and can reasonably accurately estimate language divergence times. The accuracy and performance of these methods continues to be an ongoing research topic within phylogenetics and performance will likely continue to improve (Drummond and Suchard 2010; Lanfear, Welch, and Bromham 2010; Duchêne, Lanfear, and Ho 2014). 5. How accurate are the results of Bayesian phylolinguistics? How well do Bayesian phylogenetic methods work at recovering language history? As with any inferential tool it is critical to assess the performance and accuracy. There are two ways we can assess the performance and accuracy of phylogenetic methods on linguistic data. The first way is to validate the phylolinguistic results with results from the comparative method, while the second validates the results of analyses of simulated data where the complete history is known. 5.1 Verification with the comparative method. Many of the studies using phylogenetic methods discuss how consistent their results are with those of the comparative method. Studies showing strong concordance with the comparative method and their phylogenetically estimated subgroups include Aslian (Dunn et al. 2011), Athabaskan (Sicoli and Holton 2014), Bantu (Grollemund et al. 2015; Grollemund 2012), Chapacuran (Birchall, Dunn, and Greenhill 2016), Huon Peninsula (Greenhill 2015), Indo-European (Bouckaert et al. 2012; Chang et al. 2015), Pama-Nyungan (Bowern and Atkinson 2012), Semitic 13 (Kitchen et al. 2009), Timor-Alor-Pantar (Robinson and Holton 2012), Tupian (Galucio et al. 2015), Tupi-Guarani (Michael et al. 2015), and Uralic (Syrjänen et al. 2013). To date perhaps the best comparison between the findings of the traditional comparative method and phylogenetic methods can be found in the Austronesian languages. There are strong hypotheses based about the origin, and subgrouping of these languages derived from the comparative method, and robust dates for many of the main groups derived from archaeology (Diamond and Bellwood 2003; Bellwood and Dizon 2008; Robert Blust 2013; Andrew Pawley 2002; Green 2003; A. M. S. Ko et al. 2014). The Austronesian languages originated in Taiwan where 9 of the 10 major branches are still spoken (Robert Blust 1999; Robert Blust 2013) around 5500 years ago (Robert Blust 2013). The remaining branch – Proto-Malayo-Polynesian – spread south through the Philippines around 1000 years later (Bellwood and Dizon 2008; Robert Blust 1999; Andrew Pawley 2002), through Indonesia and along the coast of New Guinea. By around 3000-3200 years ago the Austronesians can be strongly linked to the development of the Lapita cultural complex in Near Oceania (Sheppard, Chiu, and Walter 2015; Green 2003) which brought a distinctive dentate stamped and red-slipped pottery style, a range of domesticates, and social practices. Finally, around 2800-3000 years ago the Austronesians settled Western Polynesia around 2900 years ago, paused again for 1000 years, and then entered East Polynesia (Wilmshurst et al. 2011). Importantly, Austronesian is a difficult test case as it is a large family with more than 1200 languages that originated from a rapid population expansion into and across a range of new environments, encountering speakers of very different languages, and undergoing substantial shifts in population size. All of these factors have resulted in huge variation in cognate retention rates between languages (Robert Blust 2000; Greenhill 2015). One of the largest lexicostatistical studies ever conducted (Dyen 1962) on these languages produced bizarre results – concluding that the homeland of the Austronesian language family was in Near Oceania rather than in Taiwan (Greenhill, Drummond, and Gray 2010). We used phylogenetic methods on 400 Austronesian languages to test between the above ‘pulsepause’ settlement scenario and an alternative proposal from genetics. This alternative scenario, the “Slow Boat”, suggests the Austronesians originated in Wallacea (the region around modern day Sulawesi) between 13,000 and 17,000 years ago before spreading north into the Philippines and Taiwan, and east into Oceania (Oppenheimer and Richards 2001; Soares et al. 2008). Our results (Gray, Drummond, and Greenhill 2009) showed overwhelming support for the first ‘pulsepause’ scenario and none for the ‘slow boat’ scenario. This support for the pulse-pause scenario was both for the broad tree topology and age of the family – we estimated a mean age of 5230 years (95% highest posterior density interval of 4730-5790 years). In subsequent papers we carefully evaluated the similarity between the phylogenetic tree and the subgroupings proposed by the comparative method (Greenhill, Drummond, and Gray 2010; Greenhill and Gray 2012). Our analysis did in fact recover most of the key subgroupings identified by the comparative method. Figure 3 visualizes this strong consistency between the linguistically attested subgroupings (based on the classification in the Ethnologue (Lewis 2009) and the phylogenetic results. The similarities are overwhelming – and statistically significant (Greenhill, Drummond, and Gray 2010; Greenhill and Gray 2012). We recovered the following groups (working down the tree): Malayo-Polynesian (Dahl 1973), the Philippine ‘microgroups’ e.g. Bashiic, Central Luzon, Central Philippines, Cordilleran, Manobo, Palawanic, and Subanun (R. Blust 1991), Malayic (Adelaar 1992), Chamic (Thurgood 1999), Celebic (Mead 2003), Greater South Sulawesi (Adelaar 1994), North Sarawak (R. Blust 1974), Eastern Malayo-Polynesian (R. Blust 1978), CentralEastern Malayo-Polynesian (R. Blust 1978), Bima-Sumba (Robert Blust 2008), Central Maluku 14 (Collins 1982), Yamdena-North Bomberai (Robert Blust 1993), and South-Halmahera/West New Guinea (R. Blust 1978). We recover the large Oceanic subfamily [Dempwolff (1927); Ross1998], and its Near Oceanic component groups including Temotu (Malcolm Ross and Næss 2007), SouthEast Solomonic (A. Pawley 1972), Admiralties (M. Ross 1988), Papuan Tip (M. Ross 1988), MesoMelanesian (M. Ross 1988), and the majority of the North New Guinea languages are strongly grouped together (M. Ross 1988) (some languages from Willaumez are misplaced due to unidentified loan words (Greenhill, Drummond, and Gray 2010)). Remote Oceania is also well resolved with the phylogenies recovering Central Pacific (Grace 1959), Polynesian [Andrew Pawley (1966); Marck2000], Micronesian(Bender et al. 2003), South Vanuatu (Lynch 2001), North and Central Vanuatu [Tryon (1976); Clark2009], New Caledonia and the Loyalties (Ozanne-Rivierre 1992). To be fair, there are some mismatches with the expected groupings. In total 25 of the 400 languages are misplaced (Greenhill, Drummond, and Gray 2010). Some of these misplacements are due to errors in the data – e.g. the Willaumez languages (Nakanai, Maututu, Lakalai) sit closer to the North New Guinea languages rather than their expected sisters in the Meso-Melanesian subgroup. This misplacement is probably due to unidentified lexical borrowings between the Willaumez languages and their Meso-Melanesian neighbours in west New Britain. However, many of the other misplacements are due to long-standing classification difficulties. For example, it is unclear whether Irarutu should be in the Central Malayo-Polynesian or the SouthHalmahera/West New Guinea subgroup (Blust 1993). Another group we do not recover is Central Malayo-Polynesian, however, this group is now thought to be a dialect chain with low internal cohesion that is only supported by overlapping isoglosses (Blust 1993). So while there are some errors in the data – discussed in full in (Greenhill, Drummond, and Gray 2010) – the Bayesian phylogenetic trees do remarkably well at recovering both the known subgroups and the places of uncertainty. 5.2 Validation with simulations The second way we might validate phylogenetic inference is to simulate data on a known ‘true’ phylogeny and then analyze the simulated data to see if we recover the ‘true’ tree. This approach is commonly used in the biological phylogenetics literature to test the effectiveness of these methods. Greenhill et al (Greenhill, Currie, and Gray 2009) applied this logic to test how well Bayesian phylogenetic methods could recover the true tree with lexical cognate data. First, we created two tree topologies: one shaped like Polynesian with an unbalanced, chained topology, and the other shaped like Uto-Aztecan with a well-balanced tree. Then we used a simple ‘stochastic Dollo’ model of linguistic evolution (Nicholls and Gray 2006; Atkinson et al. 2005) to simulate cognate data. This model started at the root of the tree and worked its way down to the tips evolving new cognate sets to create a new simulated dataset. One of the major objections to phylogenetic methods in linguistics is that the borrowing of items between languages invalidates the tree (Terrell 1988, Moore 1994). Rather than throwing methods away at the first sign of trouble it is better to assess performance and identify when the method breaks down. To this end we also simulated borrowing events in these data. During the simulation process, traits were randomly selected and copied to another lineage simulating the borrowing of a cognate set between languages. We varied the amount of borrowing from 0% to 50% of the total cognates. Finally, for each of the simulated datasets, we used Bayesian phylogenetic methods to estimate the posterior probability distribution. For each analysis we then constructed the maximum clade 15 credibility tree (a single summary tree of the posterior) and then compared how close this estimate was to the ‘true’ tree. Our results show that when there is no borrowing the analysis finds a tree almost statistically indistinguishable from the true tree. As the amount of borrowing in the data increased so did the distance between the true tree and the recovered tree. What level of borrowing is plausible? A reasonably extreme example of language borrowing is English (leaving aside creoles and mixed languages), which has borrowed more than 60% of its total lexicon from French and Latin (Embleton 1986). However, only 16% of English’s basic vocabulary is borrowed (Embleton 1986). Given that most language phylogenies are built from basic vocabulary data, we took 0-20% as the extreme range of undetected borrowing in the type of datasets used by language phylogenies. Over this range the distance between the true trees and the recovered trees was very small – the balanced topology showed an average perturbation of at most 0.7%, while the more fragile chained topology showed an average perturbation of 6.8% at most. In summary, this simulation study shows that Bayesian phylogenetic methods are exceptionally good at finding a tree very close to the true history even in the face of realistic levels of undetected borrowing. Conclusion The view we have taken here is that Bayesian methods are a powerful supplement to traditional linguistic scholarship - not a replacement. In combination with the comparative method, they enable us to estimate uncertainty in subgrouping proposals, quantify branch lengths and infer divergence dates. In the future we expect to see more combined analyses of lexical, phonological and morphological data and the development of increasingly sophisticated “computer assisted” cognate detection and models of cognate evolution. If linguists can exorcise the ghost of lexicostatistics past, then there is an exciting future to embrace. 16 References Adelaar, K.A. 1992. Proto-Malayic: A Reconstruction of Its Phonology and Part of Its Morphology and Lexicon. Canberra: Pacific Linguistics. ———. 1994. “The Classification of the Tamanic Languages.” In Language Contact and Change in the Austronesian World, edited by T. Dutton and D. Tryon, 1–42. Berlin: Mouton de Gruyter. Atkinson, Q. and Gray, R.D. 2005. Curious parallels and curious connections: Phylogenetic thinking in biology and historical linguistics. Systematic Biology, 54(4), 513-526. Atkinson, Q., Nicholls, G., Welch, D., & Gray, R. 2005. From words to dates: water into wine, mathemagic or phylogenetic inference? Transactions of the Philological Society 103(2): p.193–219. http://dx.doi.org/10.1111/j.1467-968X.2005.00151.x Bellwood, Peter, and E. Dizon. 2008. “Austronesian Cultural Origins: Out of Taiwan, via the Batanes Islands, and Onwards to Western Polynesia.” In Past Human Migrations in East Asia: Matching Archaeology, Linguistics, and Genetics, edited by A Sanchez-Mazas, R. Blench, M. D. Ross, I. Peiros, and M. Lin, 23–39. London: Routledge. Bender, B. W., Ward H. Goodenough, F. H. Jackson, J. C. Marck, Kenneth L. Rehg, H. Sohn, S. Trussel, and J. W. Wang. 2003. “Proto-Micronesian Reconstruction.” Oceanic Linguistics 1 (42): 272–358. Bergsland, K., & Vogt, H. 1962. On the validity of glottochronology. Current Anthropology 3(2): p.115–153. www.jstor.org/stable/2739527 Bickel, Balthasar, Johanna Nichols, Taras Zakharko, Alena Witzlack-Makarevich, Kristine Hildebrandt, Michael Rießler, Lennart Bierkandt, Fernando Zúñiga & John B. Lowe. (2017). The AUTOTYP typological databases. Version 0.1.0 https://github.com/autotyp/autotyp-data/tree/0.1.0 Birchall, Joshua, Michael Dunn, and Simon J. Greenhill. 2016. “A Combined Comparative and Phylogenetic Analysis of the Chapacuran Language Family.” International Journal of American Linguistics 82 (3): 255–84. Blust, R. 1974. The Proto-North Sarawak Vowel Deletion Hypothesis. University of Hawaii. ———. 1978. “Eastern Malayo-Polynesian: A Subgrouping Argument.” In Second International Conference on Austronesian Linguistics: Proceedings, Fascicle I, Western Austronesian, edited by S.A. Wurm and L. Carrington, 181–234. Canberra: Pacific: Linguistics. ———. 1991. “The Greater Central Philippines Hypothesis.” Oceanic Linguistics 30 (73): 129. Blust, Robert. 1993. “Central and Central-Eastern Malayo-Polynesian.” Oceanic Linguistics 32: 241– 93. ———. 1999. “Subgrouping, Circularity and Extinction: Some Issues in Austronesian Comparative Linguistics.” In Selected Papers from the Eighth International Conference on Austronesian Linguistics, edited by Zeitoun E. and P J-K. Li, 31–94. Taipei, Taiwan: Symposium Series of the Institute of Linguistics, Academia Sinica. ———. 2000. “Why Lexicostatistics Doesn’t Work: The ‘Universal Constant‘ Hypothesis and the Austronesian Languages.” In Time Depth in Historical Linguistics, edited by C Renfrew, A McMahon, and L Trask, 311–31. Cambridge: McDonald Institute for Archaeological Research. 17 ———. 2008. “Is There a Bima-Sumba Subgroup?” Oceanic Linguistics 47 (1): 45–113. doi:10.1353/ol.0.0006. ———. 2013. The Austronesian Languages. Revised Ed. Canberra: Asia-Pacific Linguistics. Bouckaert, R.R. 2010. DensiTree: making sense of sets of phylogenetic trees. Bioinformatics. 26(10):1372-1373. doi: 10. 1093/bioinformatics/btq110. Epub 2010 Mar 12. Bouckaert, Remco R., P. Lemey, Michael Dunn, Simon J. Greenhill, Alexander V. Alekseyenko, a. J. Drummond, Russell D. Gray, M. a. Suchard, and Quentin D. Atkinson. 2012. “Mapping the Origins and Expansion of the Indo-European Language Family.” Science 337 (6097): 957–60. doi:10.1126/science.1219669. Bowern, Claire, and Quentin D. Atkinson. 2012. “Computational Phylogenetics and the Internal Structure of Pama-Nyungan.” Language 88 (4): 817–45. Bowern C. 2016. Chirila: Contemporary and Historical Resources for the Indigenous Languages of Australia. Lang Doc Conserv 10:1–44. Burnham, Kenneth P. & David R. Anderson. 1998. Model Selection and Inference: A practical information-theoretic approach. New York: Springer. Chang, Will, Chundra Cathcart, David Hall, and Andrew Garrett. 2015. “Ancestry-Constrained Phylogenetic Analysis Supports the Indo-European Steppe Hypothesis.” Language 91 (1): 194–244. doi:10.1353/lan.2015.0005. Collins, J.T. 1982. “Linguistic Research in Maluku: A Report of Recent Fieldwork.” Oceanic Linguistics 21: 73–146. Dahl, O.C. 1973. Proto-Austronesian. Vol. 15. Sweden: Scandinavian Institute of Asian Studies Monograph Series. Dempwolff, O. 1927. “Das Austronesische Sprachgut in Den Melanesischen Sprachen.” Folia Ethnoglossica 3: 32–43. Diamond, Jared M, and Peter Bellwood. 2003. “Farmers and Their Languages: The First Expansions.” Science 300 (5619): 597–603. doi:10.1126/science.1078208. Dixon, R.M.W. 1997. The Rise and Fall of Languages. Cambridge: Cambridge University Press. Dolgopolsky, A. 1990. More about the Indo-European homeland problem. Mediterranean Language Review 6–7: p.230–248. Drummond, Alexei J., and Marc A. Suchard. 2010. “Bayesian Random Local Clocks, or One Rate to Rule Them All.” BMC Biology 8 (114): 114. doi:10.1186/1741-7007-8-114. Drummond, Alexei J., Simon Y. W. Ho, Matthew J. Phillips, and Andrew Rambaut. 2006. “Relaxed Phylogenetics and Dating with Confidence.” PLOS Biology 4 (5): e88. Duchêne, Sebastián, Robert Lanfear, and Simon Y W Ho. 2014. “The Impact of Calibration and Clock-Model Choice on Molecular Estimates of Divergence Times.” Molecular Phylogenetics and Evolution 78 (June): 277–89. doi:10.1016/j.ympev.2014.05.032. Dunn, Michael, Niclas Burenhult, Nicole Kruspe, Sylvia Tufvesson, and Neele Becker. 2011. “Aslian Linguistic Prehistory: A Case Study in Computational Phylogenetics.” Diachronica 28 (3): 291–323. doi:10.1075/dia.28.3.01dun. Dyen, Isidore. 1962. “The Lexicostatistical Classification of the Malayopolynesian Languages.” Language 38: 38–46. 18 Dyen, I., Kruskal, J.B., & Black, P. 1992. An Indoeuropean Classification: A Lexicostatistical Experiment. Philadelphia: American Philosophical Society. www.wordgumbo.com/ie/cmp/iedata.txt Embleton, Sheila. 1986. Statistics in Historical Linguistics. Bochum: Studienverlag Brockmeyer. Felsenstein, J. 1978. The number of evolutionary trees. Systematic Zoology 27: 27-33 Galucio, Ana Vilacy, Sérgio Meira, Joshua Birchall, Denny Moore, Nilson Gabas Júnior, Sebastian Drude, Luciana Storto, Gessiane Picanço, and Carmen Reis Rodrigues. 2015. “Genealogical Relations and Lexical Distances Within the Tupian Linguistic Family.” Boletim Do Museu Paraense Emilio Goeldi: Ciencias Humanas 10 (2): 229–74. doi:10.1590/198181222015000200004. Grace, G.W. 1959. The Position of the Polynesian Languages Within the Austronesian (MalayoPolynesian) Language Family. Memoir 16 of the International Journal of American Linguistics. Indiana: Indiana University publications in anthropology; linguistics. Gray, Russell D., Alexei J. Drummond, and Simon J. Greenhill. 2009. “Language Phylogenies Reveal Expansion Pulses and Pauses in Pacific Settlement.” Science 323 (5913): 479–83. doi:10.1126/science.1166858. Gray, R.D., Greenhill, S.J., and Bryant, D. (2010). On the shape and fabric of human history. Philosophical Transactions of the Royal Society London, B, 365, 3923-3933. Gray, R.D. & Watts, J. (2017). Cultural macroevolution matters. Proceedings of the National Academy of Sciences, 114 (30) 7846-7852. Green, Roger Curtis. 2003. “The Lapita Horizon and Traditions-Signature for One Set of Oceanic Migrations.” In Pacific Archaeology: Assessments and Anniversary of the First Lapita Excavation (July 1952), edited by C. Sand, 95–120. New Caledonia. Greenhill, Simon J. 2015. “TransNewGuinea.org: An Online Database of New Guinea Languages.” PLoS ONE 10 (10): 1–17. doi:10.1371/journal.pone.0141563. Greenhill, Simon J., and Russell D. Gray. 2012. “Basic Vocabulary and Bayesian Phylolinguistics: Issues of Understanding and Representation.” Diachronica 29 (4): 523–37. doi:10.1075/dia.29.4.05gre. Greenhill, Simon J., Thomas E. Currie, and Russell D. Gray. 2009. “Does Horizontal Transmission Invalidate Cultural Phylogenies?” Proceedings. Biological Sciences / the Royal Society 276 (1665): 2299–2306. doi:10.1098/rspb.2008.1944. Greenhill, Simon J., Alexei J Drummond, and Russell D. Gray. 2010. “How Accurate and Robust Are the Phylogenetic Estimates of Austronesian Language Relationships?” PLoS ONE 5 (3): e9573. doi:10.1371/journal.pone.0009573. Greenhill SJ, Blust R, Gray RD (2008) The Austronesian Basic Vocabulary Database: From bioinformatics to lexomics. Evol Bioinform Online 4:271–283. Grollemund, Rebecca. 2012. “Nouvelles Approches En Classification: Application Aux Langues Bantu Du Nord-Ouest.” PhD thesis. Grollemund, Rebecca, Simon Branford, Koen Bostoen, Andrew Meade, Chris Venditti, and Mark Pagel. 2015. “Bantu Expansion Shows Habitat Alters the Route and Pace of Human Dispersals.” Proceedings of the National Academy of Sciences of the USA. doi:10.1073/pnas.1503793112. Haspelmath, M. 2005. The World Atlas of Language Structures, Oxford Univ Press, Oxford. 19 Heggarty, P. 2010. Beyond lexicostatistics: How to get more out of ‘word list’ comparisons. Diachronica 27(2): p.301–324. http://doi.org/10.1075/dia.27.2.07heg Heggarty, P. 2014. Prehistory through language and archaeology. In C. Bowern & B. Evans (eds) Routledge Handbook of Historical Linguistics, 598–626. London: Routledge. https://www.academia.edu/3687718 Heggarty, P. forthcoming. Why Indo-European? Clarifying cross-disciplinary misconceptions on farming vs. pastoralism, Journal of Indo-European Studies — special issue on Indo-European and Farming, edited by G. Kroonen & B. Comrie. Heggarty, P., & Renfrew, C. 2014. South and Island South-East Asia: Languages. In C. Renfrew & P. Bahn (eds) The Cambridge World Prehistory, 534–558. Cambridge: Cambridge University Press. Ho, Simon Y. W., and Matthew J. Phillips. 2009. “Accounting for Calibration Uncertainty in Phylogenetic Estimation of Evolutionary Divergence Times.” Systematic Biology 58 (3): 367–80. doi:10.1093/sysbio/syp035. Hymes, Dell. 1960. “Lexicostatistics so Far.” Current Anthropology 1 (1): 3–44. Kaufman, T., & Golla, V. 2000. Language groupings in the New World: their reliability and usability in cross-disciplinary studies. In C. Renfrew (ed) America Past, America Present: Genes and Languages in the Americas and Beyond, 47–57. Cambridge: McDonald Institute for Archaeological Research. Kimura, M. 1968. “Evolutionary Rate at the Molecular Level.” Nature 217: 624–26. Kirsch, John A W. 1969. “Serological Data and Phylogenetic Inference: The Problem of Rates of Change.” Systematic Zoology 18 (3): 296–311. Kitchen, Andrew, Christopher Ehret, Shiferaw Assefa, and Connie J Mulligan. 2009. “Bayesian Phylogenetic Analysis of Semitic Languages Identifies an Early Bronze Age Origin of Semitic in the Near East.” Proceedings of the Royal Society B: Biological Sciences 270 (1668): 2703– 10. doi:10.1098/rspb.2009.0408. Ko, Albert Min Shan, Chung Yu Chen, Qiaomei Fu, Frederick Delfin, Mingkun Li, Hung Lin Chiu, Mark Stoneking, and Ying Chin Ko. 2014. “Early Austronesians: Into and Out of Taiwan.” American Journal of Human Genetics 94 (3). The American Society of Human Genetics: 426–36. doi:10.1016/j.ajhg.2014.02.003. Kroeber, A L. 1955. “Linguistic Time Depth Results so Far and Their Meaning.” International Journal of American Linguistics 21 (2): 91–104. Lanfear, Robert, John J Welch, and Lindell Bromham. 2010. “Watching the Clock: Studying Variation in Rates of Molecular Evolution Between Species.” Trends in Ecology & Evolution 25 (9). Elsevier Ltd: 495–503. doi:10.1016/j.tree.2010.06.007. Lees, R B. 1953. “The Basis of Glottochronology.” Language 29 (2): 113–27. Lewis, Paul M., ed. 2009. Ethnologue: Languages of the World. 16th ed. Dallas, Texas: SIL International. List, Johann-Mattis, Jananan Sylvestre Pathmanathan, Philippe Lopez & Eric Bapteste. 2016. Unity and disunity in evolutionary sciences: process-based analogies open common research avenues for biology and linguistics. Biology Direct 11(39). 1–17. Lynch, J. 2001. The Linguistic History of Southern Vanuatu. Canberra: Pacific Linguistics. Mallory, J.P. 1989. In Search of the Indo-Europeans. London: Thames & Hudson. 20 McMahon, A.M.S., & McMahon, R. 2006. Why linguists don’t do dates. In P. Forster & C. Renfrew (eds) Phylogenetic Methods and the Prehistory of Languages, 153–160. Cambridge: McDonald Institute for Archaeological Research. Mead, D. 2003. “Evidence for a Celebic Supergroup.” In Issues in Austronesian Historical Phonology, edited by John Lynch, 115–41. Canberra: Pacific Linguistics. Meireles, Denise Maldi. 1989. Guardiães da Fronteira: Rio Guaporé, século XVIII. Petrópolis: Vozes. Michael, Lev, Natalia Chousou-polydouri, Keith Bartolomei, Erin Donnelly, Vivian Wauters, and Zachary O Hagan. 2015. “A Bayesian Phylogenetic Classification of Tupi-Guarani.” LIAMES 15 (2): 1–36. Moore, J. H. 1994. “Putting Anthropology Back Together Again: The Ethnogenetic Critique of Cladistic Theory.” American Anthropologist 96 (4): 925–48. Moran S., McCloy D., Wright R. (2014) PHOIBLE Online (Max Planck Institute for Evolutionary Anthropology, Leipzig). Murdock, George Peter. 1964. “Genetic Classification of the Austronesian Languages: A Key to Oceanic Culture History.” Ethnology 3 (2): 117–26. doi:10.2307/3772706. Nakhleh, L., Ringe, D., & Warnow, T. 2005. Perfect phylogenetic networks: a new methodology for reconstructing the evolutionary history of natural languages. Language 81(2): p.382–420. www.jstor.org/stable/4489897 Nascimento, F.F., Reis, M.d., and Yang, Z. (2017). A biologist’s guide to Bayesian phylogenetic analysis. Nature Ecology & Evolution, 1, 1446–1454 Nicholls, Geoff K., and Russell D. Gray. 2006. “Dated Ancestral Trees from Binary Trait Data.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70 (3): 545–66. doi:10.1111/j.1467-9868.2007.00648.x. Oppenheimer, Stephen J., and Martin B Richards. 2001. “Fast Trains, Slow Boats, and the Ancestry of the Polynesian Islanders.” Science Progress 84 (3): 157–81. Ozanne-Rivierre, F. 1992. “The Proto-Oceanic Consonantal System and the Languages of New Caledonia.” Oceanic Linguistics 31: 191–207. Pawley, A. 1972. “On the Internal Relationships of Eastern Oceanic Languages.” In Studies in Oceanic Culture History, edited by R.C. Green and M. Kelly, 13:3–106. Honolulu: Bernice P. Bishop Museum. Pawley, Andrew. 1966. “Polynesian Languages: A Subgrouping Based on Shared Innovations in Morphology.” Journal of the Polynesian Society 75 (1): 39–64. ———. 2002. “The Austronesian Dispersal: Languages, Technologies and People.” In Examining the Farming/Language Dispersal Hypothesis, edited by P Bellwood and Colin Renfrew, 251– 73. Cambridge: McDonald Institute for Archaeological Research. Penny, David, B J McComish, M A Charleston, and Michael D Hendy. 2001. “Mathematical Elegance with Biochemical Realism: The Covarion Model of Molecular Evolution.” Journal of Molecular Evolution 53 (6): 711–23. Ringe, D.A., Warnow, T., & Taylor, A. 2002. Indo-European and computational cladistics. Transactions of the Philological Society 100(1): p.59–129. http://dx.doi.org/10.1111/1467968X.00091 21 Robinson, Laura C, and Gary Holton. 2012. “Internal Classification of the Alor-Pantar Language Family Using Computational Methods Applied to the Lexicon.” Language Dynamics and Change 2: 1–27. doi:10.1163/22105832-20120201. Ross, M. 1988. Proto Oceanic and the Austronesian Languages of Western Melanesia. Canberra: Australian National University; Pacific Linguistics. Ross, Malcolm, and Åshild Næss. 2007. “An Oceanic Origin for Äiwoo, a Language of the Reef Islands.” Oceanic Linguistics 46: 456–98. Sanderson, Michael J. 1997. “A Nonparametric Approach to Estimating Divergence Times in the Absence of Rate Constancy.” Molecular Biology and Evolution 14 (12): 1218–31. Sheppard, Peter J, Scarlett Chiu, and Richard Walter. 2015. “Re-Dating Lapita Movement into Remote Oceania.” Journal of Pacific Archaeology 6 (1): 26–36. Sicoli, Mark A., and Gary Holton. 2014. “Linguistic Phylogenies Support Back-Migration from Beringia to Asia.” PloS One 9 (3): e91722. doi:10.1371/journal.pone.0091722. Sims-Williams, P. 1998. Genetics, linguistics, and prehistory: thinking big and thinking straight. Antiquity 72(277): p.505–527. Soares, P, J A Trejaut, J H Loo, Catherine Hill, M Mormina, C L Lee, Y.M. Chen, et al. 2008. “Climate Change and Postglacial Human Dispersals in Southeast Asia.” Molecular Biology and Evolution 25 (6): 1209–18. Sokal, Robert R, and Charles D. Michener. 1958. “A Statistical Method for Evaluating Systematic Relationships.” The University of Kansas Science Bulletin 38 (22): 1409–38. Swadesh, M. 1952. “Lexico-Statistic Dating of Prehistoric Ethnic Contacts.” Proceedings of the American Philosophical Society 96 (4): 452–63. ———. 1955. “Towards Greater Accuracy in Lexicostatistic Dating.” International Journal of American Linguistics 21 (2): 121–37. Swadesh, Morris. 1950. “Salish Internal Relationships.” International Journal of American Linguistics 16 (4): 157–67. doi:10.1086/464084. Syrjänen, Kaj, Terhi Honkola, Kalle Korhonen, Jyri Lehtinen, Outi Vesakoski, and Niklas Wahlberg. 2013. “Shedding More Light on Language Classification Using Basic Vocabularies and Phylogenetic Methods: A Case Study of Uralic.” Diachronica 30 (3): 323–52. doi:10.1075/dia.30.3.02syr. Tadmor, U., Haspelmath, M., & Taylor, B. 2010. Borrowability and the notion of basic vocabulary. Diachronica 27(2): p.226–246. http://dx.doi.org/10.1075/dia.27.2.04tad Terrell, J. 1988. “History as a Family Tree, History as an Entangled Bank: Constructing Images and Interpretations of Prehistory in the South Pacific.” Antiquity 62: 642–57. Thurgood, G. 1999. From Ancient Cham to Modern Dialects. 28. Hawaii: Oceanic Linguistics Special Publications. Tryon, D. T. 1976. New Hebrides Languages: An Internal Classification. Canberra: Pacific Linguistics. Welch, John J, and Lindell Bromham. 2005. “Molecular Dating When Rates Vary.” Trends in Ecology & Evolution 20 (6): 320–7. doi:10.1016/j.tree.2005.02.007. Wilmshurst, Janet M, Terry L Hunt, Carl P Lipo, and Atholl J Anderson. 2011. “High-Precision Radiocarbon Dating Shows Recent and Rapid Initial Human Colonization of East Polynesia.” 22 Proceedings of the National Academy of Sciences of the United States of America 108 (5): 1815–20. doi:10.1073/pnas.1015876108. Yang, Z. 1993. “Maximum-Likelihood Estimation of Phylogeny from Dna Sequences When Substitution Rates Differ over Sites.” Molecular Biology and Evolution 10 (6): 1396–1401. Zuckerkandl, E., and L. Pauling. 1965. “History of Evolutionary Molecules as Documents.” Journal of Theoretical Biology 8 23 FIGURES Figure 1. The steps involved in a Bayesian phylogenetic analysis of lexical data. First the lexical data is cognate coded, then the cognate sets are expressed as either a multistate or binary matrix, then the prior information for the Bayesian analysis is specified, then the stochastic model of character change is selected, then multiple MCMC searches are run, and finally the resulting posterior distribution of trees and their associated parameters are summarized. Figure 2. A densitree of Central Pacific languages constructed from the posterior distribution of trees from a Bayesian analysis of basic vocabulary. Note that the densitree reveals considerable conflicting signal that can not be captured in single tree. Figure 3. A comparison of the Austronesian phylogeny (Maximum Clade Credibility tree from a Bayesian analysis) from Gray et al. (2009) vs. the “known” subgrouping from Ethnologue. The Bayesian estimate recovers most of the major subgroups in the Ethnologue classification, and the mismatches mainly occur where the putative subgroups are not broadly accepted. Both trees differ enormously from lexicostatistical estimates of Austronesian relationships (see Greenhill and Gray 2012). LEXICAL DATA COGNATE CODED DATA PRIOR INFORMATION Fijian taba-na Fijian A 100 Tree prior Marquesan peheu Marquesan B 010 Branch length Hawaiian ēheu Hawaiian B 010 Calibrations Tahitian pererau Tahitian C 001 Constraints Maori parirau Maori C 001 Model parameters MODEL SELECTION MCMC SEARCH POSTERIOR DISTRIBUTION OF TREES Simple Binary Hawaiian Covarion Marquesan Stochastic Dollo Hawaiian Marquesan Maori Maori Multistate/Binary Strict/Relaxed Clock Tahitian Tahitian Fijian Fijian 24 FijianBau Rotuman Tongan Samoan Tikopia EastUvea EastFutuna Rennellese Kapingamarangi Nukuoro Luangiua Mangareva Marquesan Hawaiian Maori SouthIslandMaori RapanuiEasterIsland Tuamotu TahitianModern Rarotongan 25 Ethnologue Phylogenetic Eastern Polynesian Ellicean Futunic PCP PCP Micronesian North & Central Vanuatu South Vanuatu S.E. Solomonic South Vanuatu Temotu Admiralties Oc EMP WOc S.E. Solomonic Meso-Melanesian Meso-Melanesian Temotu Admiralties Oc EMP North New Guinea Papuan Tip CEMP SHWNG Central Malayo-Polynesian CEMP MP Western Malayo-Polynesian MP Philippines Formosan View publication stats