Quantitative Linguistics by Jiří Milička
In our experiment, the Saussurean postulate of arbitrariness has been empirically tested in order... more In our experiment, the Saussurean postulate of arbitrariness has been empirically tested in order to see whether this postulate can be applied to all words to the same extent. Three hundred participants were asked to match Czech words with their Hindi translations. One set of words was randomly chosen from a Hindi corpus (set A); the second set consisted of both randomly chosen words and words categorized as ideophones (set B). The participants were successful in matching both sets (the lower level of the confidence interval is about 7% above random guessing), and their performance showed unexpected patterns: For one, not only iconic properties (the sound qualities) but also iconicity itself is an important distinctive feature and recipients are able to exploit this. Moreover, even words considered to be non-iconic (set A) apparently contain a degree of iconicity, which participants are able to draw upon. However, participants appear to lose this ability when non-iconic words are presented in the context of words with evident and abundant iconicity (set B). The effect resembles the accommodation process which is known for other senses; therefore, we call the effect “Iconicity flash blindness”.
Reinhard Köhler (1984) proposed an idea that the linguistic constructs which have to be processed... more Reinhard Köhler (1984) proposed an idea that the linguistic constructs which have to be processed by the human parser consist of plain information (that is needed to be communicated) and the structure information, and that this can explain Menzerath's law. Our paper assumes that the amount of plain information and the amount of the structure information are mutually independent. A new model of the nested structure of text and Menzerath's law can be based on this assumption. A formula derived from the model is successfully tested and the results are compared to the classical Menzerath-Altmann law.
Linguistic Frontiers, 2018
Previous studies based on English, Russian and Ch inese corpora show that the average word length... more Previous studies based on English, Russian and Ch inese corpora show that the average word length in texts grows steadily across centuries. These findings are in accordance with our results: the average word length in Arabic texts also grows during the analysed time span (8th century to the first half of the 20th century). Our paper shows the detailed statistics of the word length distribution century by century. The dynamics of the average word length correlates with the dynamics of the average word distribution entropy, which encourages an explanation of the phenomenon based on the Shannonian theory of communication.
Empirical Approaches to Text and Language Analysis, 2014
Examining a large corpus of Greek texts we found that the average length of syllables in the disy... more Examining a large corpus of Greek texts we found that the average length of syllables in the disyllabic words is lower than the average length of the syllable in monosyllabic words and lower than the average length of syllables in tri-syllabic words. This peculiar phenomenon can be interpreted as a counterexample of the Menzerah's Law.
Methods and Applications of Quantitative Linguistics - Selected papers of the 8th International Conference on Quantitative Linguistics (QUALICO), 2013
This paper shows that type-token relation, hapax-token relation and, generally, relation between ... more This paper shows that type-token relation, hapax-token relation and, generally, relation between types of certain frequency and tokens can be computed from the rank-frequency relation or from any type of frequency distribution and that type-token relation can be computed from the hapax-token relation. This paper shows that there is no need for any approximation or assumptions and that the formulae can be derived purely algebraically. The second part of the paper observes that, for a very large corpora, the ratio between the number of hapax legomena and types converges to a constant Z; Z>0. Under this assumption an approximation is built that enables us to predict type-token relation and other aforementioned relations from the single parameter Z. This approximation is only valid for very large corpora. As the last chapter shows, this assumption implies that for an infinitely increasing number of tokens, the number of types increases beyond any limit.
Contains an exact formula for computing Type-token relation curve from a frequency distribution o... more Contains an exact formula for computing Type-token relation curve from a frequency distribution of types of a text (or from rank-frequency distribution). The formula is generalized to compute not only the number of the types, but also the number of the types of a certain frequency.
In Arabic, mutual order of prepositional phrases syntactically dependent on one head is neither f... more In Arabic, mutual order of prepositional phrases syntactically dependent on one head is neither fixed nor random. This paper explores the factors affecting the order of prepositions from and to. Many factors related to syntax, morphology and phonology are taken into account and analysed with a corpus driven approach.
Journal of Quantitative Linguistics, 2013
This article deals with the one of the oldest and most traditional fields in quantitative linguis... more This article deals with the one of the oldest and most traditional fields in quantitative linguistics, the concept of vocabulary richness. Although there are several methods for vocabulary richness measurement, all of them are influenced by text size. Therefore, the authors propose a new way of vocabulary richness measurement without any text length dependence. In the second part of the article, the new method is used for a genre analysis in texts written by the Czech writer Karel Čapek. Furthermore, differences between authors and between languages are studied with this method.
Issues in Quantitative Linguistics 4, 2016
Length motifs (L-motifs) are defined as sequences of words whose lengths are monotonously increas... more Length motifs (L-motifs) are defined as sequences of words whose lengths are monotonously increasing. In recent years, L-motifs have attracted well-deserved attention as they provide a new view of texts and their syntagmatic properties and nested structures. This study examines the key L-motifs, i.e. motifs that are overrepresented in texts and negative key L-motifs that are underrepresented in texts. The data reveal motifs that are typical for Czech texts, motifs that are typical for Arabic texts, and motifs that are typical for both Czech and Arabic texts – their existence suggests that there are new general language-independent patterns waiting to be explored.
Sequences in Language and Text, Apr 2015
The distribution of L-motifs (measured on a text T) is similar to the L-motifs distribution measu... more The distribution of L-motifs (measured on a text T) is similar to the L-motifs distribution measured on the pseudotext T’ constructed by random transposition of all tokens within the text T. This inspires the suggestion that the distribution of L-motifs is inherited from the word length distribution (or, by other words, that the word length distribution of a text implies the distribution of L-motifs). The paper clearly shows that despite of the similarity, an L-motifs structure, independent of the word length distribution, can be detected.
Simonetta Montemagni, Joakim Nivre (Eds.): Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017), 2017
According to the Menzerath-Altmann law, there is a relation between the size of the whole and the... more According to the Menzerath-Altmann law, there is a relation between the size of the whole and the mean size of its parts. The validity of the law was demonstrated on relations between several language units, e.g., the longer a word, the shorter the syllables the word consists of. In this paper it is shown that the law is valid also in syntactic dependency structure in Czech. In particular, longer clauses tend to be composed of shorter phrases (the size of a phrase is measured by the number of words it consists of).
This paper describes an application designed for the functional testing of mutual intelligibilit... more This paper describes an application designed for the functional testing of mutual intelligibility of related varieties from its data structure to its interface and use.
This paper presents the results of a pilot project designed to functionally test the mutual intel... more This paper presents the results of a pilot project designed to functionally test the mutual intelligibility of spoken Maltese, Tunisian Arabic and Benghazi Libyan Arabic. We compiled an audio-based intelligibility test consisting of three components: a word test where the respondents were asked to perform a semantic classification task with 11 semantic categories; a sentence test where the task was to provide a translation of a sentence into the respondent’s native language and a text test where a short text was listened to twice and the respondents were asked to answer 8 multiple-choice questions. We collected data from 24 respondents in Malta, Tunis and Benghazi which we analyzed to determine that there exists asymmetric mutual intelligibility between the two mainstream Arabic varieties and Maltese where speakers of Tunisian and Benghazi Arabic are able to understand about 40% of what is being said to them in Maltese, whereas that ratio is about 30% for speakers of Maltese exposed to either variety of Arabic. Additionally, we found that Tunisian Arabic has the highest level of mutual intelligibility with either of the other two varieties. Combining the intelligibility scores with edit distance data, we were able to sketch out the variables involved in enabling and inhibiting mutual intelligibility for all three varieties of Arabic and provide a rough analysis of the linguistic distance between them as branches of North African Arabic.
Natural Language Processing by Jiří Milička
Proceedings of CITALA 2014 (5th International Conference on Arabic Language Processing ), 2014
The contribution introduces a corpus linguistic search engine that ranks its results according to... more The contribution introduces a corpus linguistic search engine that ranks its results according to the keyness measure and the importance of the document within the corpus. For this purpose, the minimal ratio is measured for each word and the corpus is hypertextualized. Differences between genres are taken into account.
When comparing the use of two word types within one text, we can do it by comparing the contexts ... more When comparing the use of two word types within one text, we can do it by comparing the contexts in which they occur. We pick all the tokens that occur e.g. immediatelly to the right of the word A and immediatelly to the right of the word B, thus getting two multiple subsets of text. This paper offers a method for comparing such subsets (and its use is not limited only to the field of linguistics). The method is based on comparing the cardinality of the intersection of the two multiple subsets and a model which characterizes the average cardinality of all possible subsets of a given length from the given text. The model is derived algebraically.
Czech and Slovak Linguistic Review 1/2012, 2012
The paper defines and shows how to use the Minimal Ratio – an exact metric that expresses the rat... more The paper defines and shows how to use the Minimal Ratio – an exact metric that expresses the ratio between the measured value and the limits of the confidence interval calculated according to the formula Fischer’s exact test is based on. The metric is meant to assist with keywords and collocations extraction and comparing texts or corpora according to the word types distribution or other similar criteria.
This contribution deals with the use of quotations (repeated n-grams) in the works of medieval Ar... more This contribution deals with the use of quotations (repeated n-grams) in the works of medieval Arabic literature. The analysis is based on a 420 millions of words historical corpus of Arabic. Based on repeated quotations from work to work, a network is constructed and used for interpretation of various aspects of Arabic literature. Two short case studies are presented, concentrating on the centrality and relevance of individual works, and the analysis of a time depth and resulting impact of a given work in various periods.
Computerised and Corpus-based Approaches to Phraseology: Monolingual and Multilingual Perspectives
Restricted collocability has received some attention, but not as a formalized method. We suggest ... more Restricted collocability has received some attention, but not as a formalized method. We suggest that it should be used as a metrics for collocations, as well as for other types of usage, both in linguistics and even outside it, as it has great potentials for a plethora of applications. On the examples from a diachronic corpus of Arabic, we show the possibilities of its employment in studying prepositional valency and lexical profiling.
Software by Jiří Milička
Software for measuring type-token relation, hapax-token relation and other similar types of relat... more Software for measuring type-token relation, hapax-token relation and other similar types of relations in a given text. The data is then modeled by the combinatorial model from the types distribution in the text
"The software is designed to contribute to discover the text inhomogeneities by comparing type-to... more "The software is designed to contribute to discover the text inhomogeneities by comparing type-token relation of the text and its combinatorial model. Parts of a text in which number of types rises disproportionally are marked. The quick increase ( i.e. a new topic is introduced or style or language is changed) is marked by the green colour, while slow increase of types (i.e. repeating of old topics or even autoquotations). The software is appropriate also for the literary science.
The application allows its user to change the direction of the processing the text - forwards and backwards. When checking both forwards and backwards, unique parts of the texts (comparing with the rest of the text) are marked by the green colour, while typical parts are marked by the red colour. The freeware application provides a graphic user interface."
Uploads
Quantitative Linguistics by Jiří Milička
Natural Language Processing by Jiří Milička
Software by Jiří Milička
The application allows its user to change the direction of the processing the text - forwards and backwards. When checking both forwards and backwards, unique parts of the texts (comparing with the rest of the text) are marked by the green colour, while typical parts are marked by the red colour. The freeware application provides a graphic user interface."
The application allows its user to change the direction of the processing the text - forwards and backwards. When checking both forwards and backwards, unique parts of the texts (comparing with the rest of the text) are marked by the green colour, while typical parts are marked by the red colour. The freeware application provides a graphic user interface."
Václav Cvrček: Kvantitativní analýza kontextu. Praha: Nakladatelství Lidové noviny, 2013. 288 s.